Skip to main content
Score represents a single metric or assessment within an evaluation result.

Schema

class Score:
    key: str              # Required: metric identifier
    value: float = None   # Optional: numeric score
    passed: bool = None   # Optional: pass/fail status
    notes: str = None     # Optional: explanation
At least one of value or passed must be provided.

Fields

key (required)

A string identifier for the metric. Use descriptive names:
{"key": "accuracy"}
{"key": "response_time"}
{"key": "format_valid"}
{"key": "contains_keywords"}

value (optional)

A numeric score, typically in the 0-1 range:
{"key": "confidence", "value": 0.95}
{"key": "similarity", "value": 0.82}
{"key": "quality", "value": 0.7}

passed (optional)

A boolean indicating pass/fail:
{"key": "format_valid", "passed": True}
{"key": "safety_check", "passed": False}

notes (optional)

Human-readable explanation:
{
    "key": "accuracy",
    "passed": True,
    "notes": "Output matches reference exactly"
}

{
    "key": "length",
    "passed": False,
    "notes": "Response too short: 15 chars, expected 50+"
}

Score Types

Boolean Score

Simple pass/fail:
{"key": "correct", "passed": True}
{"key": "format_valid", "passed": False, "notes": "Missing closing bracket"}

Numeric Score

Continuous metric:
{"key": "confidence", "value": 0.87}
{"key": "similarity", "value": 0.92, "notes": "Cosine similarity"}

Combined Score

Both numeric and boolean:
{
    "key": "quality",
    "value": 0.75,
    "passed": True,  # Passes threshold
    "notes": "Score: 0.75 (threshold: 0.6)"
}

Creating Scores

Via add_score()

The recommended way:
@eval(dataset="demo", default_score_key="accuracy")
async def my_eval(ctx: EvalContext):
    # Boolean with notes
    ctx.add_score(True, "Exact match")

    # Numeric
    ctx.add_score(0.85, "Confidence score", key="confidence")

    # Boolean with custom key
    ctx.add_score(output.is_valid, "Valid format", key="format")

Direct Construction

For evaluators or manual creation:
score = {
    "key": "semantic_similarity",
    "value": compute_similarity(output, reference),
    "passed": similarity > 0.8,
    "notes": f"Similarity: {similarity:.2f}"
}

Multiple Scores

An evaluation can have multiple scores:
@eval(dataset="qa")
async def test_answer(ctx: EvalContext):
    ctx.input = "What is Python?"
    ctx.add_output(await agent(ctx.input))

    ctx.add_score(is_correct, "Factually correct", key="accuracy")
    ctx.add_score(is_concise, "Under 100 words", key="brevity")
    ctx.add_score(confidence, "Model confidence", key="confidence")
    ctx.add_score(is_safe, "No harmful content", key="safety")
Result:
{
    "scores": [
        {"key": "accuracy", "passed": true, "notes": "Factually correct"},
        {"key": "brevity", "passed": true, "notes": "Under 100 words"},
        {"key": "confidence", "value": 0.92, "notes": "Model confidence"},
        {"key": "safety", "passed": true, "notes": "No harmful content"}
    ]
}

Score Aggregation

The CLI and Web UI aggregate scores:
AggregationDescription
Pass Rate% of scores where passed=True
AverageMean of all value scores
By KeyGroup scores by key for analysis

Default Score Key

Set via decorator to name auto-created scores:
@eval(default_score_key="correctness")
async def my_eval(ctx: EvalContext):
    ctx.input = "test"
    ctx.add_output("result")
    ctx.add_score(True, "All checks passed")
    # Score will have key="correctness"

Auto-Scoring

If no score is added, Twevals creates one automatically:
@eval(dataset="demo", default_score_key="success")
async def test_no_explicit_score(ctx: EvalContext):
    ctx.input = "test"
    ctx.add_output("result")
    # Auto-adds: {"key": "success", "passed": True}

Best Practices

# Good - consistent naming
ctx.add_score(..., key="accuracy")
ctx.add_score(..., key="format_valid")
ctx.add_score(..., key="response_time")

# Avoid - inconsistent
ctx.add_score(..., key="Accuracy")
ctx.add_score(..., key="is_format_valid")
ctx.add_score(..., key="responseTime")
# Good - explains the result
ctx.add_score(
    False,
    f"Expected '{expected}', got '{actual}'",
    key="accuracy"
)

# Less helpful
ctx.add_score(False, "Failed", key="accuracy")
# Good - 0-1 range
ctx.add_score(similarity / 100, key="similarity")

# Harder to interpret
ctx.add_score(similarity, key="similarity")  # 0-100 range
# Useful for thresholded metrics
score = 0.75
ctx.add_score(
    score,
    f"Score: {score} (threshold: 0.7)",
    key="quality"
)
# Also set passed based on threshold