RAG Agent Evaluation

This example shows a common pattern: evaluating a RAG agent across a dataset of question-answer pairs, checking for both hallucination and correctness.

The Setup

You have a RAG agent that:

Takes a user question
Retrieves relevant documents
Generates an answer grounded in those documents

You want to evaluate:

Hallucination — Is the answer grounded in the retrieved documents?
Correctness — Does the answer match the expected reference?

The Eval

from twevals import eval, parametrize, EvalContext

async def run_agent(ctx: EvalContext):
    results = await run_rag_agent(ctx.input)
    ctx.output = results.response
    ctx.metadata["source_docs"] = results.metadata["source_docs"]


@eval(target=run_agent, dataset="rag_qa")
@parametrize("input,reference", [
    ("What is our refund policy?", "30-day money-back guarantee"),
    ("How do I reset my password?", "Click 'Forgot Password' on the login page"),
    ("What payment methods do you accept?", "Visa, Mastercard, and PayPal"),
    ("How long does shipping take?", "3-5 business days for standard shipping"),
    ("Can I change my order after placing it?", "Within 1 hour of placing the order"),
    # ... hundreds more rows
])
async def test_rag_agent(ctx: EvalContext):
    """Evaluate a RAG Agent for correctness and hallucinations"""
    # LLM-as-a-Judge for hallucination
    score, reasoning = await hallucination_judge(
        answer=ctx.output,
        sources=ctx.metadata["source_docs"]
    )
    ctx.add_score(score, reasoning, key="hallucination")

    # Check correctness against reference
    correctness_result = await correctness_judge(
        answer=ctx.output,
        reference=ctx.reference
    )
    assert correctness_result.is_correct, correctness_result.explanation

What’s Happening

Parametrized dataset — The @parametrize decorator creates one eval run per row. Each row sets ctx.input and ctx.reference automatically. Target function — The run_agent target runs before the eval body, populating ctx.output and storing source docs in metadata. Storing context for analysis — We save the retrieved documents to ctx.metadata. This shows up in the results JSON and Web UI, so you can debug retrieval issues. Multiple scoring criteria — We use ctx.add_score() for hallucination (a named score) and assert for correctness (the default score). Both appear in your results. LLM-as-judge — The hallucination_judge and correctness_judge are placeholder functions representing whatever LLM judge you’re using (OpenAI, Anthropic, your own prompts, etc.).

Running It

# Run headlessly
twevals run evals/rag_agent.py

# Start the web UI
twevals serve evals/rag_agent.py

# Run verbosely to see each result
twevals run evals/rag_agent.py --verbose

Example Results

After running, you’ll have a JSON file with structured results:

{
  "dataset": "rag_qa",
  "results": [
    {
      "input": "What is our refund policy?",
      "output": "We offer a 30-day money-back guarantee on all purchases.",
      "reference": "30-day money-back guarantee",
      "metadata": {
        "retrieved_docs": ["policy.md: section 4.2..."],
        "sources": ["policy.md"]
      },
      "scores": {
        "hallucination": {
          "passed": true,
          "message": "Answer is fully grounded in retrieved documents"
        },
        "default": {
          "passed": true,
          "message": null
        }
      }
    }
  ]
}

Your coding agent can read this JSON, analyze patterns, identify which questions have retrieval issues, and suggest improvements—all without leaving the terminal.

Getting Started

Examples

Core Concepts

Guides

API Reference

RAG Agent Evaluation

RAG Agent Evaluation

The Setup

The Eval

What’s Happening

Running It

Example Results

Getting Started

Examples

Core Concepts

Guides

API Reference

​RAG Agent Evaluation

​The Setup

​The Eval

​What’s Happening

​Running It

​Example Results

RAG Agent Evaluation

The Setup

The Eval

What’s Happening

Running It

Example Results