Parametrize

The @parametrize decorator lets you generate multiple evaluations from one function, similar to pytest’s parametrize.

Basic Usage

from twevals import eval, parametrize, EvalContext

@eval(dataset="math")
@parametrize("a,b,expected", [
    (2, 3, 5),
    (10, 20, 30),
    (0, 0, 0),
])
def test_addition(ctx: EvalContext, a, b, expected):
    result = a + b
    ctx.input = f"{a} + {b}"
    ctx.output = result
    assert result == expected, f"Expected {expected}, got {result}"

This generates three evaluations with numeric IDs:

test_addition[0]
test_addition[1]
test_addition[2]

Without custom ids, test variants are numbered sequentially. Use the ids parameter for readable names.

Parameter Formats

Tuple List

@parametrize("name,age", [
    ("Alice", 30),
    ("Bob", 25),
])

Dictionary List

Custom IDs

Name your test cases:

@parametrize("threshold", [0.2, 0.5, 0.8], ids=["low", "mid", "high"])
def test_thresholds(ctx: EvalContext, threshold):
    ...

Generates:

test_thresholds[low]
test_thresholds[mid]
test_thresholds[high]

Stacking Decorators (Cartesian Product)

Stack multiple @parametrize decorators for all combinations:

@eval(dataset="model_comparison")
@parametrize("model", ["gpt-4", "gpt-3.5", "claude"])
@parametrize("temperature", [0.0, 0.5, 1.0])
def test_models(ctx: EvalContext, model, temperature):
    ...

This generates 9 evaluations (3 models × 3 temperatures) with numeric IDs:

test_models[0]
test_models[1]
… through test_models[8]

For readable names, provide ids on each decorator.

Auto-Mapping to Context

Only parameters with these special names automatically populate the context:

input → ctx.input
reference → ctx.reference
metadata → merged into ctx.metadata
run_data → ctx.run_data
latency → ctx.latency

@eval(dataset="qa")
@parametrize("input,reference", [
    ("What is 2+2?", "4"),
    ("What is the capital of France?", "Paris"),
])
async def test_qa(ctx: EvalContext):
    # ctx.input and ctx.reference are auto-populated
    ctx.output = await my_agent(ctx.input)
    assert ctx.output == ctx.reference

Accessing Non-Special Parameters

For parameters that aren’t special context fields, you must include them in your function signature:

@parametrize("prompt,expected", [...])
def test_example(ctx: EvalContext, prompt, expected):
    # Parameters are passed as function arguments
    ctx.input = prompt
    ctx.output = process(prompt)
    assert ctx.output == expected

Arbitrary parameters are not injected onto ctx. If you omit parameters from the function signature, you’ll get “unexpected keyword argument” errors.

Running Specific Variants

Run a specific parametrized test:

# Run all variants
twevals run evals.py::test_math

# Run specific variant
twevals run evals.py::test_math[2-3-5]

Combining with Other Options

@eval(
    dataset="sentiment",
    default_score_key="accuracy",
    metadata_from_params=["model"],
    timeout=5.0
)
@parametrize("model", ["gpt-4", "gpt-3.5"])
@parametrize("text,expected", [
    ("I love this!", "positive"),
    ("This is terrible", "negative"),
])
async def test_sentiment(ctx: EvalContext, model, text, expected):
    ctx.input = text
    ctx.metadata["text_length"] = len(text)
    ctx.output = await analyze_sentiment(text, model=model)
    assert ctx.output == expected

Data-Driven Testing

Load test cases from external sources:

import json

# Load from file
with open("test_cases.json") as f:
    test_cases = json.load(f)

@eval(dataset="qa")
@parametrize("question,answer", test_cases)
async def test_qa(ctx: EvalContext, question, answer):
    ctx.input = question
    ctx.reference = answer
    ctx.output = await my_agent(question)
    assert ctx.output == answer

Example: Comprehensive Test Suite

from twevals import eval, parametrize, EvalContext

PROMPTS = [
    {"input": "Hello", "expected_intent": "greeting", "expected_sentiment": "positive"},
    {"input": "I need help", "expected_intent": "support", "expected_sentiment": "neutral"},
    {"input": "This is broken!", "expected_intent": "complaint", "expected_sentiment": "negative"},
]

@eval(dataset="intent_detection", default_score_key="accuracy")
@parametrize("input,expected_intent,expected_sentiment", PROMPTS)
async def test_intent(ctx: EvalContext, expected_intent, expected_sentiment):
    # ctx.input is auto-populated (special name)
    result = await detect_intent(ctx.input)
    ctx.output = result

    assert result["intent"] == expected_intent
    assert result["sentiment"] == expected_sentiment

Getting Started

Examples

Core Concepts

Guides

API Reference

Basic Usage

Parameter Formats

Tuple List

Dictionary List

Custom IDs

Stacking Decorators (Cartesian Product)

Auto-Mapping to Context

Accessing Non-Special Parameters

Running Specific Variants

Combining with Other Options

Data-Driven Testing

Example: Comprehensive Test Suite

Getting Started

Examples

Core Concepts

Guides

API Reference

​Basic Usage

​Parameter Formats

​Tuple List

​Dictionary List

​Custom IDs

​Stacking Decorators (Cartesian Product)

​Auto-Mapping to Context

​Accessing Non-Special Parameters

​Running Specific Variants

​Combining with Other Options

​Data-Driven Testing

​Example: Comprehensive Test Suite

Basic Usage

Parameter Formats

Tuple List

Dictionary List

Custom IDs

Stacking Decorators (Cartesian Product)

Auto-Mapping to Context

Accessing Non-Special Parameters

Running Specific Variants

Combining with Other Options

Data-Driven Testing

Example: Comprehensive Test Suite