Skip to main content
The @eval decorator marks functions as evaluations. Here’s a complete example:
from twevals import eval, parametrize, EvalContext

@eval(
    dataset="customer_service",
    labels=["production", "critical"],
    input="I want to cancel my subscription",
    metadata={"category": "cancellation"},
    timeout=10.0,
)
async def test_cancellation_handling(ctx: EvalContext):
    ctx.output = await customer_agent(ctx.input)

    assert "cancel" in ctx.output.lower(), "Should address cancellation"
    assert len(ctx.output) > 50, "Response too short"

Configuration Options

Dataset

Groups related evaluations together:
@eval(dataset="sentiment_analysis")
async def test_positive_sentiment(ctx: EvalContext):
    ...

@eval(dataset="sentiment_analysis")
async def test_negative_sentiment(ctx: EvalContext):
    ...
If not specified, dataset defaults to the filename (e.g., evals.pyevals).

Labels

Tags for filtering:
@eval(labels=["production", "fast"])
async def test_quick_response(ctx: EvalContext):
    ...

@eval(labels=["experimental", "slow"])
async def test_complex_reasoning(ctx: EvalContext):
    ...
Filter with CLI:
twevals run evals.py --label production

Pre-populated Fields

Set context fields directly in the decorator:
@eval(
    input="What is 2 + 2?",
    reference="4",
    metadata={"difficulty": "easy"}
)
async def test_arithmetic(ctx: EvalContext):
    ctx.output = await my_agent(ctx.input)
    assert ctx.output == ctx.reference

Default Score Key

Specify the key for scores (used when assertions fail or with add_score()):
@eval(input="Classify this text", default_score_key="accuracy")
async def test_classification(ctx: EvalContext):
    ctx.output = await classifier(ctx.input)
    assert ctx.output in ["positive", "negative", "neutral"]

Metadata from Parameters

Auto-extract parametrized values to metadata:
@eval(metadata_from_params=["model", "temperature"])
@parametrize("model,temperature,prompt", [
    ("gpt-4", 0.0, "Hello"),
    ("gpt-3.5", 0.7, "Hello"),
])
async def test_models(ctx: EvalContext, model, temperature, prompt):
    # metadata automatically includes {"model": "gpt-4", "temperature": 0.0}
    ...

Timeout

Set a maximum execution time:
@eval(input="complex task", timeout=5.0)
async def test_with_timeout(ctx: EvalContext):
    ctx.output = await slow_agent(ctx.input)
On timeout, the evaluation fails with an error message.

Target Hook

Run a function before the evaluation body:
def call_agent(ctx: EvalContext):
    ctx.output = my_agent(ctx.input)

@eval(input="What's the weather?", target=call_agent)
async def test_weather(ctx: EvalContext):
    # ctx.output already populated by target
    assert "weather" in ctx.output.lower()
This separates agent invocation from assertion logic.

Evaluators

Post-processing functions that add scores:
def check_length(result):
    return {
        "key": "length",
        "passed": len(result.output) > 10,
        "notes": f"Output length: {len(result.output)}"
    }

@eval(input="test", evaluators=[check_length])
async def test_response(ctx: EvalContext):
    ctx.output = await my_agent(ctx.input)
    # check_length runs after, adds "length" score

Sync and Async

Both sync and async functions work—just use async def if your code uses await.

Returning Multiple Results

Return a list of EvalResult objects for batch evaluations:
from twevals import eval, EvalResult

@eval(dataset="batch")
def test_batch():
    results = []
    for prompt in ["hello", "hi", "hey"]:
        output = my_agent(prompt)
        results.append(EvalResult(
            input=prompt,
            output=output,
            scores=[{"key": "valid", "passed": True}]
        ))
    return results

All Options Reference

OptionTypeDescription
datasetstrGroup name for the evaluation
labelslist[str]Tags for filtering
inputAnyPre-populate ctx.input
referenceAnyPre-populate ctx.reference
metadatadictPre-populate ctx.metadata
metadata_from_paramslist[str]Auto-extract params to metadata
default_score_keystrKey for auto-added scores
timeoutfloatMax execution time in seconds
targetcallablePre-hook to run before evaluation
evaluatorslist[callable]Post-processing score functions