The @parametrize decorator lets you generate multiple evaluations from one function, similar to pytest’s parametrize.
Basic Usage
from twevals import eval, parametrize, EvalContext
@eval(dataset="math")
@parametrize("a,b,expected", [
(2, 3, 5),
(10, 20, 30),
(0, 0, 0),
])
def test_addition(ctx: EvalContext, a, b, expected):
result = a + b
ctx.input = f"{a} + {b}"
ctx.output = result
assert result == expected, f"Expected {expected}, got {result}"
This generates three evaluations with numeric IDs:
test_addition[0]
test_addition[1]
test_addition[2]
Without custom ids, test variants are numbered sequentially. Use the ids parameter for readable names.
Tuple List
@parametrize("name,age", [
("Alice", 30),
("Bob", 25),
])
Dictionary List
More readable for complex cases:
@parametrize("operation,a,b,expected", [
{"operation": "add", "a": 2, "b": 3, "expected": 5},
{"operation": "multiply", "a": 4, "b": 5, "expected": 20},
{"operation": "subtract", "a": 10, "b": 3, "expected": 7},
])
Custom IDs
Name your test cases:
@parametrize("threshold", [0.2, 0.5, 0.8], ids=["low", "mid", "high"])
def test_thresholds(ctx: EvalContext, threshold):
...
Generates:
test_thresholds[low]
test_thresholds[mid]
test_thresholds[high]
Stacking Decorators (Cartesian Product)
Stack multiple @parametrize decorators for all combinations:
@eval(dataset="model_comparison")
@parametrize("model", ["gpt-4", "gpt-3.5", "claude"])
@parametrize("temperature", [0.0, 0.5, 1.0])
def test_models(ctx: EvalContext, model, temperature):
...
This generates 9 evaluations (3 models × 3 temperatures) with numeric IDs:
test_models[0]
test_models[1]
- … through
test_models[8]
For readable names, provide ids on each decorator.
Auto-Mapping to Context
Only parameters with these special names automatically populate the context:
input → ctx.input
reference → ctx.reference
metadata → merged into ctx.metadata
run_data → ctx.run_data
latency → ctx.latency
@eval(dataset="qa")
@parametrize("input,reference", [
("What is 2+2?", "4"),
("What is the capital of France?", "Paris"),
])
async def test_qa(ctx: EvalContext):
# ctx.input and ctx.reference are auto-populated
ctx.output = await my_agent(ctx.input)
assert ctx.output == ctx.reference
Accessing Non-Special Parameters
For parameters that aren’t special context fields, you must include them in your function signature:
@parametrize("prompt,expected", [...])
def test_example(ctx: EvalContext, prompt, expected):
# Parameters are passed as function arguments
ctx.input = prompt
ctx.output = process(prompt)
assert ctx.output == expected
Arbitrary parameters are not injected onto ctx. If you omit parameters from the function signature, you’ll get “unexpected keyword argument” errors.
Running Specific Variants
Run a specific parametrized test:
# Run all variants
twevals run evals.py::test_math
# Run specific variant
twevals run evals.py::test_math[2-3-5]
Combining with Other Options
@eval(
dataset="sentiment",
default_score_key="accuracy",
metadata_from_params=["model"],
timeout=5.0
)
@parametrize("model", ["gpt-4", "gpt-3.5"])
@parametrize("text,expected", [
("I love this!", "positive"),
("This is terrible", "negative"),
])
async def test_sentiment(ctx: EvalContext, model, text, expected):
ctx.input = text
ctx.metadata["text_length"] = len(text)
ctx.output = await analyze_sentiment(text, model=model)
assert ctx.output == expected
Data-Driven Testing
Load test cases from external sources:
import json
# Load from file
with open("test_cases.json") as f:
test_cases = json.load(f)
@eval(dataset="qa")
@parametrize("question,answer", test_cases)
async def test_qa(ctx: EvalContext, question, answer):
ctx.input = question
ctx.reference = answer
ctx.output = await my_agent(question)
assert ctx.output == answer
Example: Comprehensive Test Suite
from twevals import eval, parametrize, EvalContext
PROMPTS = [
{"input": "Hello", "expected_intent": "greeting", "expected_sentiment": "positive"},
{"input": "I need help", "expected_intent": "support", "expected_sentiment": "neutral"},
{"input": "This is broken!", "expected_intent": "complaint", "expected_sentiment": "negative"},
]
@eval(dataset="intent_detection", default_score_key="accuracy")
@parametrize("input,expected_intent,expected_sentiment", PROMPTS)
async def test_intent(ctx: EvalContext, expected_intent, expected_sentiment):
# ctx.input is auto-populated (special name)
result = await detect_intent(ctx.input)
ctx.output = result
assert result["intent"] == expected_intent
assert result["sentiment"] == expected_sentiment