Skip to main content
Every eval function receives an EvalContext that accumulates evaluation data and builds into an EvalResult.
from twevals import eval, EvalContext

@eval(input="What is 2 + 2?", dataset="demo")
async def my_eval(ctx: EvalContext):
    ctx.output = await my_agent(ctx.input)
    assert ctx.output == "4", f"Expected 4, got {ctx.output}"

Pre-populated Fields

Set context fields directly in the @eval decorator:
@eval(
    input="What is the capital of France?",
    reference="Paris",
    dataset="geography",
    metadata={"category": "capitals"}
)
async def test_capital(ctx: EvalContext):
    ctx.output = await my_agent(ctx.input)
    assert ctx.output == ctx.reference
With @parametrize, special parameter names (input, reference, metadata, run_data, latency) auto-populate context fields:
@eval(dataset="sentiment")
@parametrize("input,reference", [
    ("I love this!", "positive"),
    ("This is terrible", "negative"),
])
async def test_sentiment(ctx: EvalContext):
    ctx.output = await analyze_sentiment(ctx.input)
    assert ctx.output == ctx.reference

Setting Fields

Input, Output, and Reference

ctx.input = "Your test input"
ctx.output = "The model's response"
ctx.reference = "Expected output"  # Optional

Metadata

Store additional context for debugging and analysis:
ctx.metadata["model"] = "gpt-4"
ctx.metadata["temperature"] = 0.7

Scoring with Assertions

Use assertions to score:
ctx.output = await my_agent(ctx.input)

assert ctx.output is not None, "Got no output"
assert "expected" in ctx.output.lower(), "Missing expected content"
Failed assertions become failing scores with the message as notes.

Advanced Scoring with add_score()

For numeric scores or multiple named metrics:
# Numeric score (0-1 range)
ctx.add_score(0.85, "Confidence score")

# Multiple scores with different keys
ctx.add_score(True, "Format correct", key="format")
ctx.add_score(0.9, "Relevance score", key="relevance")

Auto-Return Behavior

You don’t need to explicitly return anything—the context automatically builds into an EvalResult when the function completes:
@eval(input="test", dataset="demo")
async def my_eval(ctx: EvalContext):
    ctx.output = await my_agent(ctx.input)
    assert ctx.output, "Got empty response"
    # No return needed

Exception Safety

If your evaluation throws an exception, partial data is preserved:
@eval(input="test input", dataset="demo")
async def test_with_error(ctx: EvalContext):
    ctx.output = await my_agent(ctx.input)
    # If this fails, input and output are still recorded
    assert ctx.output == "expected"
The resulting EvalResult will have:
  • input and output preserved
  • error field with the exception message
  • A failing score automatically added

Default Scoring

If no score is added and no assertions fail, Twevals auto-adds a passing score:
@eval(input="test", dataset="demo", default_score_key="correctness")
async def test_auto_pass(ctx: EvalContext):
    ctx.output = "result"
    # No explicit score - auto-passes with key "correctness"

Custom Parameters

For parametrized tests with custom parameter names, include them in your function signature:
@eval(dataset="demo")
@parametrize("prompt,expected_category", [
    ("I want a refund", "complaint"),
    ("Thank you!", "praise"),
])
async def test_classification(ctx: EvalContext, prompt, expected_category):
    ctx.input = prompt
    ctx.output = await classify(prompt)
    assert ctx.output == expected_category

API Reference

MethodDescription
add_score(value, notes, key)Add a score (boolean or numeric)
build()Convert to immutable EvalResult
PropertyTypeDescription
inputAnyThe test input
outputAnyThe system output
referenceAnyExpected output (optional)
metadatadictCustom metadata
run_datadictDebug/trace data
latencyfloatExecution time
scoreslistList of Score objects