The @eval decorator marks functions as evaluations. Here’s a complete example:
from twevals import eval, parametrize, EvalContext
@eval(
dataset="customer_service",
labels=["production", "critical"],
input="I want to cancel my subscription",
metadata={"category": "cancellation"},
timeout=10.0,
)
async def test_cancellation_handling(ctx: EvalContext):
ctx.output = await customer_agent(ctx.input)
assert "cancel" in ctx.output.lower(), "Should address cancellation"
assert len(ctx.output) > 50, "Response too short"
Configuration Options
Dataset
Groups related evaluations together:
@eval(dataset="sentiment_analysis")
async def test_positive_sentiment(ctx: EvalContext):
...
@eval(dataset="sentiment_analysis")
async def test_negative_sentiment(ctx: EvalContext):
...
If not specified, dataset defaults to the filename (e.g., evals.py → evals).
Labels
Tags for filtering:
@eval(labels=["production", "fast"])
async def test_quick_response(ctx: EvalContext):
...
@eval(labels=["experimental", "slow"])
async def test_complex_reasoning(ctx: EvalContext):
...
Filter with CLI:
twevals run evals.py --label production
Pre-populated Fields
Set context fields directly in the decorator:
@eval(
input="What is 2 + 2?",
reference="4",
metadata={"difficulty": "easy"}
)
async def test_arithmetic(ctx: EvalContext):
ctx.output = await my_agent(ctx.input)
assert ctx.output == ctx.reference
Default Score Key
Specify the key for scores (used when assertions fail or with add_score()):
@eval(input="Classify this text", default_score_key="accuracy")
async def test_classification(ctx: EvalContext):
ctx.output = await classifier(ctx.input)
assert ctx.output in ["positive", "negative", "neutral"]
Auto-extract parametrized values to metadata:
@eval(metadata_from_params=["model", "temperature"])
@parametrize("model,temperature,prompt", [
("gpt-4", 0.0, "Hello"),
("gpt-3.5", 0.7, "Hello"),
])
async def test_models(ctx: EvalContext, model, temperature, prompt):
# metadata automatically includes {"model": "gpt-4", "temperature": 0.0}
...
Timeout
Set a maximum execution time:
@eval(input="complex task", timeout=5.0)
async def test_with_timeout(ctx: EvalContext):
ctx.output = await slow_agent(ctx.input)
On timeout, the evaluation fails with an error message.
Target Hook
Run a function before the evaluation body:
def call_agent(ctx: EvalContext):
ctx.output = my_agent(ctx.input)
@eval(input="What's the weather?", target=call_agent)
async def test_weather(ctx: EvalContext):
# ctx.output already populated by target
assert "weather" in ctx.output.lower()
This separates agent invocation from assertion logic.
Evaluators
Post-processing functions that add scores:
def check_length(result):
return {
"key": "length",
"passed": len(result.output) > 10,
"notes": f"Output length: {len(result.output)}"
}
@eval(input="test", evaluators=[check_length])
async def test_response(ctx: EvalContext):
ctx.output = await my_agent(ctx.input)
# check_length runs after, adds "length" score
Sync and Async
Both sync and async functions work—just use async def if your code uses await.
Returning Multiple Results
Return a list of EvalResult objects for batch evaluations:
from twevals import eval, EvalResult
@eval(dataset="batch")
def test_batch():
results = []
for prompt in ["hello", "hi", "hey"]:
output = my_agent(prompt)
results.append(EvalResult(
input=prompt,
output=output,
scores=[{"key": "valid", "passed": True}]
))
return results
All Options Reference
| Option | Type | Description |
|---|
dataset | str | Group name for the evaluation |
labels | list[str] | Tags for filtering |
input | Any | Pre-populate ctx.input |
reference | Any | Pre-populate ctx.reference |
metadata | dict | Pre-populate ctx.metadata |
metadata_from_params | list[str] | Auto-extract params to metadata |
default_score_key | str | Key for auto-added scores |
timeout | float | Max execution time in seconds |
target | callable | Pre-hook to run before evaluation |
evaluators | list[callable] | Post-processing score functions |