Scoring

Twevals uses assertions as the primary way to score evaluations. If you’ve written pytest tests, you already know how to score in Twevals.

Assertions

Use Python’s assert statement to validate your outputs:

@eval(input="What is the capital of France?", dataset="qa")
async def test_answer_quality(ctx: EvalContext):
    ctx.output = await my_agent(ctx.input)
    assert "paris" in ctx.output.lower(), "Should mention Paris"

When assertions pass, your eval passes. When they fail, the assertion message becomes the failure reason in your results.

Multiple Assertions

Chain assertions to check multiple conditions:

@eval(input="I want a refund", dataset="customer_service")
async def test_refund_response(ctx: EvalContext):
    ctx.output = await support_agent(ctx.input)

    assert len(ctx.output) > 20, "Response too short"
    assert "refund" in ctx.output.lower(), "Should acknowledge refund request"
    assert "sorry" in ctx.output.lower() or "apologize" in ctx.output.lower(), \
        "Should express empathy"

Comparing to Reference

Use ctx.reference for expected output comparisons:

@eval(input="2 + 2", reference="4", dataset="math")
async def test_arithmetic(ctx: EvalContext):
    ctx.output = await calculator(ctx.input)
    assert ctx.output == ctx.reference

With Parametrize

Assertions work naturally with parametrized tests:

@eval(dataset="sentiment")
@parametrize("input,reference", [
    ("I love this!", "positive"),
    ("This is terrible", "negative"),
    ("It's okay", "neutral"),
])
async def test_sentiment(ctx: EvalContext):
    ctx.output = await classify_sentiment(ctx.input)
    assert ctx.output == ctx.reference

How Assertions Become Scores

When an assertion fails:

The eval doesn’t crash - it’s caught gracefully
A failing score is created with the assertion message as notes
Your ctx.input and ctx.output are preserved for debugging

# If this assertion fails:
assert ctx.output == "Paris", "Wrong capital"

# It creates this score:
{
    "key": "correctness",
    "passed": False,
    "notes": "Wrong capital"
}

When all assertions pass, a passing score is automatically added.

add_score() (For Advanced Cases)

Use add_score() when you need more control than assertions provide:

Numeric scores (confidence values, similarity scores)
Multiple named metrics per evaluation
Non-binary scoring (partial credit)

Numeric Scores

@eval(input="Classify this text", dataset="classification")
async def test_with_confidence(ctx: EvalContext):
    result = await classifier(ctx.input)
    ctx.output = result["label"]
    ctx.add_score(result["confidence"], "Model confidence")

Multiple Named Metrics

@eval(input="Explain quantum computing", dataset="qa")
async def test_comprehensive(ctx: EvalContext):
    ctx.output = await my_agent(ctx.input)

    ctx.add_score("quantum" in ctx.output.lower(), "Mentions quantum", key="relevance")
    ctx.add_score(len(ctx.output) < 500, f"Length: {len(ctx.output)}", key="brevity")
    ctx.add_score(calculate_similarity(ctx.output, reference), "Semantic similarity", key="similarity")

Combining Assertions and add_score

You can use both in the same eval:

@eval(input="Generate a summary", dataset="demo")
async def test_mixed(ctx: EvalContext):
    ctx.output = await summarizer(ctx.input)

    # Must-pass requirements as assertions
    assert ctx.output is not None, "Got no output"
    assert len(ctx.output) > 10, "Output too short"

    # Numeric quality metric
    ctx.add_score(quality_score(ctx.output), "Quality score", key="quality")

Evaluators (Post-Processing)

Evaluators are functions that run after the eval completes and add additional scores:

def check_format(result):
    """Check if output is valid JSON."""
    try:
        json.loads(result.output)
        return {"key": "format", "passed": True}
    except:
        return {"key": "format", "passed": False, "notes": "Invalid JSON"}

@eval(input="Get user data", dataset="api", evaluators=[check_format])
async def test_json_response(ctx: EvalContext):
    ctx.output = await api_call(ctx.input)
    assert "user" in ctx.output, "Should contain user data"

Evaluators are useful for reusable checks across many evals.

Score Structure Reference

Each score has:

{
    "key": "metric_name",    # Identifier (default: "correctness")
    "value": 0.95,           # Optional: numeric score (0-1 range)
    "passed": True,          # Optional: boolean pass/fail
    "notes": "Explanation"   # Optional: human-readable notes
}

Every score must have at least one of value or passed.

When to Use What

Scenario	Use
Simple pass/fail checks	`assert`
Multiple conditions that must all pass	Multiple `assert` statements
Numeric confidence/similarity scores	`add_score(0.85, ...)`
Multiple independent metrics	`add_score(..., key="metric_name")`
Reusable checks across evals	Evaluators

Getting Started

Examples

Core Concepts

Guides

API Reference

Assertions

Multiple Assertions

Comparing to Reference

With Parametrize

How Assertions Become Scores

add_score() (For Advanced Cases)

Numeric Scores

Multiple Named Metrics

Combining Assertions and add_score

Evaluators (Post-Processing)

Score Structure Reference

When to Use What

Getting Started

Examples

Core Concepts

Guides

API Reference

​Assertions

​Multiple Assertions

​Comparing to Reference

​With Parametrize

​How Assertions Become Scores

​add_score() (For Advanced Cases)

​Numeric Scores

​Multiple Named Metrics

​Combining Assertions and add_score

​Evaluators (Post-Processing)

​Score Structure Reference

​When to Use What

Assertions

Multiple Assertions

Comparing to Reference

With Parametrize

How Assertions Become Scores

add_score() (For Advanced Cases)

Numeric Scores

Multiple Named Metrics

Combining Assertions and add_score

Evaluators (Post-Processing)

Score Structure Reference

When to Use What