EvalResult is the immutable data structure that represents a completed evaluation.
Schema
class EvalResult:
input: Any # Required: test input
output: Any # Required: system output
reference: Any = None # Optional: expected output
scores: list[Score] = [] # Optional: list of scores
error: str = None # Optional: error message
latency: float = None # Optional: execution time in seconds
metadata: dict = {} # Optional: custom metadata
run_data: dict = {} # Optional: debug/trace data
Fields
The input provided to the system under test.
EvalResult(
input="What is the capital of France?",
output="Paris"
)
Can be any type: string, dict, list, etc.
output (required)
The output produced by the system.
EvalResult(
input="2 + 2",
output={"answer": 4, "confidence": 0.99}
)
reference (optional)
The expected or reference output for comparison.
EvalResult(
input="What is 2 + 2?",
output="4",
reference="4"
)
scores (optional)
List of Score objects representing evaluation metrics.
EvalResult(
input="test",
output="result",
scores=[
{"key": "accuracy", "passed": True},
{"key": "latency", "value": 0.95}
]
)
error (optional)
Error message if the evaluation failed with an exception.
EvalResult(
input="test",
output=None,
error="TimeoutError: Evaluation exceeded 5.0 seconds"
)
latency (optional)
Execution time in seconds. Automatically measured by Twevals.
EvalResult(
input="test",
output="result",
latency=0.234
)
Custom metadata for filtering and analysis.
EvalResult(
input="test",
output="result",
metadata={
"model": "gpt-4",
"temperature": 0.7,
"category": "factual"
}
)
run_data (optional)
Debug and trace information. Not displayed in main UI views.
EvalResult(
input="test",
output="result",
run_data={
"tokens_used": 150,
"api_calls": 2,
"trace_id": "abc123"
}
)
Creating Results
From EvalContext
The typical way—let EvalContext build it:
@eval(dataset="demo")
async def my_eval(ctx: EvalContext):
ctx.input = "test"
ctx.add_output(await my_agent(ctx.input))
assert ctx.output, "Got empty response"
# Returns EvalResult automatically
Direct Construction
For batch operations or custom flows:
from twevals import EvalResult
result = EvalResult(
input="What is 2 + 2?",
output="4",
reference="4",
scores=[{"key": "accuracy", "passed": True}],
metadata={"difficulty": "easy"}
)
Multiple Results
Return a list from an evaluation:
@eval(dataset="batch")
def test_batch():
return [
EvalResult(input="a", output="1"),
EvalResult(input="b", output="2"),
EvalResult(input="c", output="3"),
]
Accessing Fields
Results are immutable after creation:
result = EvalResult(input="test", output="result")
print(result.input) # "test"
print(result.output) # "result"
print(result.latency) # None (not set)
Serialization
Results serialize to JSON automatically:
result = EvalResult(
input="test",
output="result",
scores=[{"key": "accuracy", "passed": True}]
)
# To dict
data = result.model_dump()
# To JSON string
json_str = result.model_dump_json()
Status Determination
A result is considered “passed” if any score has passed=True:
# Passed - has a passing score
EvalResult(input="x", output="y", scores=[{"key": "a", "passed": True}])
# Passed - mixed scores, but at least one passes
EvalResult(input="x", output="y", scores=[
{"key": "a", "passed": True},
{"key": "b", "passed": False}
])
# Failed - all scores fail
EvalResult(input="x", output="y", scores=[{"key": "a", "passed": False}])
# Failed - has error
EvalResult(input="x", output="y", error="Something went wrong")
The pass/fail logic checks if any score has passed=True, not if all scores pass. Mixed results with at least one passing score will show as passed.
Complete Example
from twevals import EvalResult
result = EvalResult(
input={
"question": "What is the capital of France?",
"context": "Geography quiz"
},
output="The capital of France is Paris.",
reference="Paris",
scores=[
{
"key": "accuracy",
"passed": True,
"notes": "Correct answer"
},
{
"key": "brevity",
"value": 0.7,
"notes": "Could be more concise"
}
],
latency=0.234,
metadata={
"model": "gpt-4",
"temperature": 0.0,
"category": "geography"
},
run_data={
"tokens": 45,
"api_latency": 0.189
}
)