EvalContext

Every eval function receives an EvalContext that accumulates evaluation data and builds into an EvalResult.

from twevals import eval, EvalContext

@eval(input="What is 2 + 2?", dataset="demo")
async def my_eval(ctx: EvalContext):
    ctx.output = await my_agent(ctx.input)
    assert ctx.output == "4", f"Expected 4, got {ctx.output}"

Pre-populated Fields

Set context fields directly in the @eval decorator:

@eval(
    input="What is the capital of France?",
    reference="Paris",
    dataset="geography",
    metadata={"category": "capitals"}
)
async def test_capital(ctx: EvalContext):
    ctx.output = await my_agent(ctx.input)
    assert ctx.output == ctx.reference

With @parametrize, special parameter names (input, reference, metadata, run_data, latency) auto-populate context fields:

@eval(dataset="sentiment")
@parametrize("input,reference", [
    ("I love this!", "positive"),
    ("This is terrible", "negative"),
])
async def test_sentiment(ctx: EvalContext):
    ctx.output = await analyze_sentiment(ctx.input)
    assert ctx.output == ctx.reference

Setting Fields

Input, Output, and Reference

ctx.input = "Your test input"
ctx.output = "The model's response"
ctx.reference = "Expected output"  # Optional

Metadata

Store additional context for debugging and analysis:

ctx.metadata["model"] = "gpt-4"
ctx.metadata["temperature"] = 0.7

Scoring with Assertions

Use assertions to score:

ctx.output = await my_agent(ctx.input)

assert ctx.output is not None, "Got no output"
assert "expected" in ctx.output.lower(), "Missing expected content"

Failed assertions become failing scores with the message as notes.

Advanced Scoring with add_score()

For numeric scores or multiple named metrics:

# Numeric score (0-1 range)
ctx.add_score(0.85, "Confidence score")

# Multiple scores with different keys
ctx.add_score(True, "Format correct", key="format")
ctx.add_score(0.9, "Relevance score", key="relevance")

Auto-Return Behavior

You don’t need to explicitly return anything—the context automatically builds into an EvalResult when the function completes:

@eval(input="test", dataset="demo")
async def my_eval(ctx: EvalContext):
    ctx.output = await my_agent(ctx.input)
    assert ctx.output, "Got empty response"
    # No return needed

Exception Safety

If your evaluation throws an exception, partial data is preserved:

@eval(input="test input", dataset="demo")
async def test_with_error(ctx: EvalContext):
    ctx.output = await my_agent(ctx.input)
    # If this fails, input and output are still recorded
    assert ctx.output == "expected"

The resulting EvalResult will have:

input and output preserved
error field with the exception message
A failing score automatically added

Default Scoring

If no score is added and no assertions fail, Twevals auto-adds a passing score:

@eval(input="test", dataset="demo", default_score_key="correctness")
async def test_auto_pass(ctx: EvalContext):
    ctx.output = "result"
    # No explicit score - auto-passes with key "correctness"

Custom Parameters

For parametrized tests with custom parameter names, include them in your function signature:

@eval(dataset="demo")
@parametrize("prompt,expected_category", [
    ("I want a refund", "complaint"),
    ("Thank you!", "praise"),
])
async def test_classification(ctx: EvalContext, prompt, expected_category):
    ctx.input = prompt
    ctx.output = await classify(prompt)
    assert ctx.output == expected_category

API Reference

Method	Description
`add_score(value, notes, key)`	Add a score (boolean or numeric)
`build()`	Convert to immutable EvalResult

Property	Type	Description
`input`	Any	The test input
`output`	Any	The system output
`reference`	Any	Expected output (optional)
`metadata`	dict	Custom metadata
`run_data`	dict	Debug/trace data
`latency`	float	Execution time
`scores`	list	List of Score objects

Getting Started

Examples

Core Concepts

Guides

API Reference

Pre-populated Fields

Setting Fields

Input, Output, and Reference

Metadata

Scoring with Assertions

Advanced Scoring with add_score()

Auto-Return Behavior

Exception Safety

Default Scoring

Custom Parameters

API Reference

Getting Started

Examples

Core Concepts

Guides

API Reference

​Pre-populated Fields

​Setting Fields

​Input, Output, and Reference

​Metadata

​Scoring with Assertions

​Advanced Scoring with add_score()

​Auto-Return Behavior

​Exception Safety

​Default Scoring

​Custom Parameters

​API Reference

Pre-populated Fields

Setting Fields

Input, Output, and Reference

Metadata

Scoring with Assertions

Advanced Scoring with add_score()

Auto-Return Behavior

Exception Safety

Default Scoring

Custom Parameters

API Reference