Score

Score represents a single metric or assessment within an evaluation result.

Schema

class Score:
    key: str              # Required: metric identifier
    value: float = None   # Optional: numeric score
    passed: bool = None   # Optional: pass/fail status
    notes: str = None     # Optional: explanation

At least one of value or passed must be provided.

Fields

key (required)

A string identifier for the metric. Use descriptive names:

{"key": "accuracy"}
{"key": "response_time"}
{"key": "format_valid"}
{"key": "contains_keywords"}

value (optional)

A numeric score, typically in the 0-1 range:

{"key": "confidence", "value": 0.95}
{"key": "similarity", "value": 0.82}
{"key": "quality", "value": 0.7}

passed (optional)

A boolean indicating pass/fail:

{"key": "format_valid", "passed": True}
{"key": "safety_check", "passed": False}

notes (optional)

Human-readable explanation:

{
    "key": "accuracy",
    "passed": True,
    "notes": "Output matches reference exactly"
}

{
    "key": "length",
    "passed": False,
    "notes": "Response too short: 15 chars, expected 50+"
}

Score Types

Boolean Score

Simple pass/fail:

{"key": "correct", "passed": True}
{"key": "format_valid", "passed": False, "notes": "Missing closing bracket"}

Numeric Score

Continuous metric:

{"key": "confidence", "value": 0.87}
{"key": "similarity", "value": 0.92, "notes": "Cosine similarity"}

Combined Score

Both numeric and boolean:

{
    "key": "quality",
    "value": 0.75,
    "passed": True,  # Passes threshold
    "notes": "Score: 0.75 (threshold: 0.6)"
}

Creating Scores

Via add_score()

The recommended way:

@eval(dataset="demo", default_score_key="accuracy")
async def my_eval(ctx: EvalContext):
    # Boolean with notes
    ctx.add_score(True, "Exact match")

    # Numeric
    ctx.add_score(0.85, "Confidence score", key="confidence")

    # Boolean with custom key
    ctx.add_score(output.is_valid, "Valid format", key="format")

Direct Construction

For evaluators or manual creation:

score = {
    "key": "semantic_similarity",
    "value": compute_similarity(output, reference),
    "passed": similarity > 0.8,
    "notes": f"Similarity: {similarity:.2f}"
}

Multiple Scores

An evaluation can have multiple scores:

@eval(dataset="qa")
async def test_answer(ctx: EvalContext):
    ctx.input = "What is Python?"
    ctx.add_output(await agent(ctx.input))

    ctx.add_score(is_correct, "Factually correct", key="accuracy")
    ctx.add_score(is_concise, "Under 100 words", key="brevity")
    ctx.add_score(confidence, "Model confidence", key="confidence")
    ctx.add_score(is_safe, "No harmful content", key="safety")

Result:

{
    "scores": [
        {"key": "accuracy", "passed": true, "notes": "Factually correct"},
        {"key": "brevity", "passed": true, "notes": "Under 100 words"},
        {"key": "confidence", "value": 0.92, "notes": "Model confidence"},
        {"key": "safety", "passed": true, "notes": "No harmful content"}
    ]
}

Score Aggregation

The CLI and Web UI aggregate scores:

Aggregation	Description
Pass Rate	% of scores where `passed=True`
Average	Mean of all `value` scores
By Key	Group scores by key for analysis

Default Score Key

Set via decorator to name auto-created scores:

@eval(default_score_key="correctness")
async def my_eval(ctx: EvalContext):
    ctx.input = "test"
    ctx.add_output("result")
    ctx.add_score(True, "All checks passed")
    # Score will have key="correctness"

Auto-Scoring

If no score is added, Twevals creates one automatically:

@eval(dataset="demo", default_score_key="success")
async def test_no_explicit_score(ctx: EvalContext):
    ctx.input = "test"
    ctx.add_output("result")
    # Auto-adds: {"key": "success", "passed": True}

Best Practices

Use consistent key names

# Good - consistent naming
ctx.add_score(..., key="accuracy")
ctx.add_score(..., key="format_valid")
ctx.add_score(..., key="response_time")

# Avoid - inconsistent
ctx.add_score(..., key="Accuracy")
ctx.add_score(..., key="is_format_valid")
ctx.add_score(..., key="responseTime")

Include meaningful notes

# Good - explains the result
ctx.add_score(
    False,
    f"Expected '{expected}', got '{actual}'",
    key="accuracy"
)

# Less helpful
ctx.add_score(False, "Failed", key="accuracy")

Normalize numeric scores

# Good - 0-1 range
ctx.add_score(similarity / 100, key="similarity")

# Harder to interpret
ctx.add_score(similarity, key="similarity")  # 0-100 range

Combine value and passed

# Useful for thresholded metrics
score = 0.75
ctx.add_score(
    score,
    f"Score: {score} (threshold: 0.7)",
    key="quality"
)
# Also set passed based on threshold

Getting Started

Examples

Core Concepts

Guides

API Reference

Schema

Fields

key (required)

value (optional)

passed (optional)

notes (optional)

Score Types

Boolean Score

Numeric Score

Combined Score

Creating Scores

Via add_score()

Direct Construction

Multiple Scores

Score Aggregation

Default Score Key

Auto-Scoring

Best Practices

Getting Started

Examples

Core Concepts

Guides

API Reference

​Schema

​Fields

​key (required)

​value (optional)

​passed (optional)

​notes (optional)

​Score Types

​Boolean Score

​Numeric Score

​Combined Score

​Creating Scores

​Via add_score()

​Direct Construction

​Multiple Scores

​Score Aggregation

​Default Score Key

​Auto-Scoring

​Best Practices

Schema

Fields

key (required)

value (optional)

passed (optional)

notes (optional)

Score Types

Boolean Score

Numeric Score

Combined Score

Creating Scores

Via add_score()

Direct Construction

Multiple Scores

Score Aggregation

Default Score Key

Auto-Scoring

Best Practices