Skip to main content
The @parametrize decorator lets you generate multiple evaluations from one function, similar to pytest’s parametrize.

Basic Usage

from twevals import eval, parametrize, EvalContext

@eval(dataset="math")
@parametrize("a,b,expected", [
    (2, 3, 5),
    (10, 20, 30),
    (0, 0, 0),
])
def test_addition(ctx: EvalContext, a, b, expected):
    result = a + b
    ctx.input = f"{a} + {b}"
    ctx.output = result
    assert result == expected, f"Expected {expected}, got {result}"
This generates three evaluations with numeric IDs:
  • test_addition[0]
  • test_addition[1]
  • test_addition[2]
Without custom ids, test variants are numbered sequentially. Use the ids parameter for readable names.

Parameter Formats

Tuple List

@parametrize("name,age", [
    ("Alice", 30),
    ("Bob", 25),
])

Dictionary List

More readable for complex cases:
@parametrize("operation,a,b,expected", [
    {"operation": "add", "a": 2, "b": 3, "expected": 5},
    {"operation": "multiply", "a": 4, "b": 5, "expected": 20},
    {"operation": "subtract", "a": 10, "b": 3, "expected": 7},
])

Custom IDs

Name your test cases:
@parametrize("threshold", [0.2, 0.5, 0.8], ids=["low", "mid", "high"])
def test_thresholds(ctx: EvalContext, threshold):
    ...
Generates:
  • test_thresholds[low]
  • test_thresholds[mid]
  • test_thresholds[high]

Stacking Decorators (Cartesian Product)

Stack multiple @parametrize decorators for all combinations:
@eval(dataset="model_comparison")
@parametrize("model", ["gpt-4", "gpt-3.5", "claude"])
@parametrize("temperature", [0.0, 0.5, 1.0])
def test_models(ctx: EvalContext, model, temperature):
    ...
This generates 9 evaluations (3 models × 3 temperatures) with numeric IDs:
  • test_models[0]
  • test_models[1]
  • … through test_models[8]
For readable names, provide ids on each decorator.

Auto-Mapping to Context

Only parameters with these special names automatically populate the context:
  • inputctx.input
  • referencectx.reference
  • metadata → merged into ctx.metadata
  • run_datactx.run_data
  • latencyctx.latency
@eval(dataset="qa")
@parametrize("input,reference", [
    ("What is 2+2?", "4"),
    ("What is the capital of France?", "Paris"),
])
async def test_qa(ctx: EvalContext):
    # ctx.input and ctx.reference are auto-populated
    ctx.output = await my_agent(ctx.input)
    assert ctx.output == ctx.reference

Accessing Non-Special Parameters

For parameters that aren’t special context fields, you must include them in your function signature:
@parametrize("prompt,expected", [...])
def test_example(ctx: EvalContext, prompt, expected):
    # Parameters are passed as function arguments
    ctx.input = prompt
    ctx.output = process(prompt)
    assert ctx.output == expected
Arbitrary parameters are not injected onto ctx. If you omit parameters from the function signature, you’ll get “unexpected keyword argument” errors.

Running Specific Variants

Run a specific parametrized test:
# Run all variants
twevals run evals.py::test_math

# Run specific variant
twevals run evals.py::test_math[2-3-5]

Combining with Other Options

@eval(
    dataset="sentiment",
    default_score_key="accuracy",
    metadata_from_params=["model"],
    timeout=5.0
)
@parametrize("model", ["gpt-4", "gpt-3.5"])
@parametrize("text,expected", [
    ("I love this!", "positive"),
    ("This is terrible", "negative"),
])
async def test_sentiment(ctx: EvalContext, model, text, expected):
    ctx.input = text
    ctx.metadata["text_length"] = len(text)
    ctx.output = await analyze_sentiment(text, model=model)
    assert ctx.output == expected

Data-Driven Testing

Load test cases from external sources:
import json

# Load from file
with open("test_cases.json") as f:
    test_cases = json.load(f)

@eval(dataset="qa")
@parametrize("question,answer", test_cases)
async def test_qa(ctx: EvalContext, question, answer):
    ctx.input = question
    ctx.reference = answer
    ctx.output = await my_agent(question)
    assert ctx.output == answer

Example: Comprehensive Test Suite

from twevals import eval, parametrize, EvalContext

PROMPTS = [
    {"input": "Hello", "expected_intent": "greeting", "expected_sentiment": "positive"},
    {"input": "I need help", "expected_intent": "support", "expected_sentiment": "neutral"},
    {"input": "This is broken!", "expected_intent": "complaint", "expected_sentiment": "negative"},
]

@eval(dataset="intent_detection", default_score_key="accuracy")
@parametrize("input,expected_intent,expected_sentiment", PROMPTS)
async def test_intent(ctx: EvalContext, expected_intent, expected_sentiment):
    # ctx.input is auto-populated (special name)
    result = await detect_intent(ctx.input)
    ctx.output = result

    assert result["intent"] == expected_intent
    assert result["sentiment"] == expected_sentiment