The @eval Decorator

The @eval decorator marks functions as evaluations. Here’s a complete example:

from twevals import eval, parametrize, EvalContext

@eval(
    dataset="customer_service",
    labels=["production", "critical"],
    input="I want to cancel my subscription",
    metadata={"category": "cancellation"},
    timeout=10.0,
)
async def test_cancellation_handling(ctx: EvalContext):
    ctx.output = await customer_agent(ctx.input)

    assert "cancel" in ctx.output.lower(), "Should address cancellation"
    assert len(ctx.output) > 50, "Response too short"

Configuration Options

Dataset

Groups related evaluations together:

@eval(dataset="sentiment_analysis")
async def test_positive_sentiment(ctx: EvalContext):
    ...

@eval(dataset="sentiment_analysis")
async def test_negative_sentiment(ctx: EvalContext):
    ...

If not specified, dataset defaults to the filename (e.g., evals.py → evals).

Labels

Tags for filtering:

@eval(labels=["production", "fast"])
async def test_quick_response(ctx: EvalContext):
    ...

@eval(labels=["experimental", "slow"])
async def test_complex_reasoning(ctx: EvalContext):
    ...

Filter with CLI:

twevals run evals.py --label production

Pre-populated Fields

Set context fields directly in the decorator:

@eval(
    input="What is 2 + 2?",
    reference="4",
    metadata={"difficulty": "easy"}
)
async def test_arithmetic(ctx: EvalContext):
    ctx.output = await my_agent(ctx.input)
    assert ctx.output == ctx.reference

Default Score Key

Specify the key for scores (used when assertions fail or with add_score()):

@eval(input="Classify this text", default_score_key="accuracy")
async def test_classification(ctx: EvalContext):
    ctx.output = await classifier(ctx.input)
    assert ctx.output in ["positive", "negative", "neutral"]

Metadata from Parameters

Auto-extract parametrized values to metadata:

@eval(metadata_from_params=["model", "temperature"])
@parametrize("model,temperature,prompt", [
    ("gpt-4", 0.0, "Hello"),
    ("gpt-3.5", 0.7, "Hello"),
])
async def test_models(ctx: EvalContext, model, temperature, prompt):
    # metadata automatically includes {"model": "gpt-4", "temperature": 0.0}
    ...

Timeout

Set a maximum execution time:

@eval(input="complex task", timeout=5.0)
async def test_with_timeout(ctx: EvalContext):
    ctx.output = await slow_agent(ctx.input)

On timeout, the evaluation fails with an error message.

Target Hook

Run a function before the evaluation body:

def call_agent(ctx: EvalContext):
    ctx.output = my_agent(ctx.input)

@eval(input="What's the weather?", target=call_agent)
async def test_weather(ctx: EvalContext):
    # ctx.output already populated by target
    assert "weather" in ctx.output.lower()

This separates agent invocation from assertion logic.

Evaluators

Post-processing functions that add scores:

def check_length(result):
    return {
        "key": "length",
        "passed": len(result.output) > 10,
        "notes": f"Output length: {len(result.output)}"
    }

@eval(input="test", evaluators=[check_length])
async def test_response(ctx: EvalContext):
    ctx.output = await my_agent(ctx.input)
    # check_length runs after, adds "length" score

Sync and Async

Both sync and async functions work—just use async def if your code uses await.

Returning Multiple Results

Return a list of EvalResult objects for batch evaluations:

from twevals import eval, EvalResult

@eval(dataset="batch")
def test_batch():
    results = []
    for prompt in ["hello", "hi", "hey"]:
        output = my_agent(prompt)
        results.append(EvalResult(
            input=prompt,
            output=output,
            scores=[{"key": "valid", "passed": True}]
        ))
    return results

All Options Reference

Option	Type	Description
`dataset`	str	Group name for the evaluation
`labels`	list[str]	Tags for filtering
`input`	Any	Pre-populate ctx.input
`reference`	Any	Pre-populate ctx.reference
`metadata`	dict	Pre-populate ctx.metadata
`metadata_from_params`	list[str]	Auto-extract params to metadata
`default_score_key`	str	Key for auto-added scores
`timeout`	float	Max execution time in seconds
`target`	callable	Pre-hook to run before evaluation
`evaluators`	list[callable]	Post-processing score functions

Getting Started

Examples

Core Concepts

Guides

API Reference

The @eval Decorator

Configuration Options

Dataset

Labels

Pre-populated Fields

Default Score Key

Metadata from Parameters

Timeout

Target Hook

Evaluators

Sync and Async

Returning Multiple Results

All Options Reference

Getting Started

Examples

Core Concepts

Guides

API Reference

​Configuration Options

​Dataset

​Labels

​Pre-populated Fields

​Default Score Key

​Metadata from Parameters

​Timeout

​Target Hook

​Evaluators

​Sync and Async

​Returning Multiple Results

​All Options Reference

Configuration Options

Dataset

Labels

Pre-populated Fields

Default Score Key

Metadata from Parameters

Timeout

Target Hook

Evaluators

Sync and Async

Returning Multiple Results

All Options Reference