Skip to main content
Sessions let you group related evaluation runs together, enabling workflows like model comparison, iterative debugging, and tracking progress over time.

Quick Start

# Start the web UI
twevals serve evals.py

# Run headlessly with session and run names
twevals run evals.py --session model-upgrade --run-name gpt5-baseline

# Continue the session with another run
twevals run evals.py --session model-upgrade --run-name gpt5-tuned

# Auto-generated friendly names
twevals run evals.py

Concepts

Session

A session groups related runs together, identified by a name string. Using --session model-upgrade again continues the same session—runs are linked by matching session names.

Run

A run is a single execution of your evaluations. Each run creates a new JSON file containing only the evals that ran in that execution.

Auto-Generated Names

When you don’t specify --session or --run-name, friendly adjective-noun names are generated automatically:
  • swift-falcon
  • bright-flame
  • gentle-whisper
This ensures every run has identifiable names without requiring manual input.

File Naming

Run files are named with the pattern {run_name}_{timestamp}.json:
.twevals/runs/
├── gpt5-baseline_2024-01-15T10-30-00Z.json   # Named run
├── gpt5-tuned_2024-01-15T11-00-00Z.json      # Second run in session
├── swift-falcon_2024-01-15T14-45-00Z.json    # Auto-generated name
└── latest.json                                # Copy of most recent

JSON Schema

Each run file includes session metadata at the top level:
{
  "session_name": "model-upgrade",
  "run_name": "gpt5-baseline",
  "run_id": "2024-01-15T10-30-00Z",
  "total_evaluations": 50,
  "total_passed": 45,
  "total_errors": 2,
  "results": [...]
}

UI Display

The stats bar shows the current session and run:
SESSION model-upgrade · RUN gpt5-baseline | TESTS 50 | ACCURACY 45/50 | ...

Use Cases

Compare performance across different models or configurations:
twevals run evals/ --session model-comparison --run-name gpt4-baseline
twevals run evals/ --session model-comparison --run-name claude-3
twevals run evals/ --session model-comparison --run-name gemini-pro
Each run’s JSON file contains the full results for that model, all grouped under the same session.
Track progress while debugging or improving:
twevals run evals/ --session bug-fix-123 --run-name initial
# Make changes...
twevals run evals/ --session bug-fix-123 --run-name attempt-2
# More changes...
twevals run evals/ --session bug-fix-123 --run-name fixed
Use session names to track runs across deployments:
twevals run evals/ --session "release-${VERSION}" --run-name "build-${BUILD_ID}" -o results.json
Compare different approaches:
twevals run evals/ --session prompt-experiment --run-name prompt-v1
twevals run evals/ --session prompt-experiment --run-name prompt-v2-concise
twevals run evals/ --session prompt-experiment --run-name prompt-v3-detailed

API Endpoints

When running in serve mode, these endpoints are available for programmatic access:
EndpointMethodDescription
/api/sessionsGETList all unique session names
/api/sessions/{name}/runsGETList runs for a specific session
/api/runs/{run_id}PATCHUpdate run metadata (e.g., rename)

Example: List Sessions

curl http://localhost:8000/api/sessions
{
  "sessions": ["model-upgrade", "bug-fix-123", "prompt-experiment"]
}

Example: List Runs in Session

curl http://localhost:8000/api/sessions/model-upgrade/runs
{
  "session_name": "model-upgrade",
  "runs": [
    {
      "run_id": "2024-01-15T11-00-00Z",
      "run_name": "gpt5-tuned",
      "total_evaluations": 50,
      "total_passed": 48,
      "total_errors": 0
    },
    {
      "run_id": "2024-01-15T10-30-00Z", 
      "run_name": "gpt5-baseline",
      "total_evaluations": 50,
      "total_passed": 45,
      "total_errors": 2
    }
  ]
}

Best Practices

Naming Conventions: Use descriptive session names that indicate the purpose (e.g., model-comparison, bug-fix-123, release-v2.0). Run names should describe what’s different about that specific run.
Session Continuity: Sessions are identified purely by name. Using the same --session value automatically continues the session—no explicit “create” or “resume” needed.
Storage: Session metadata is stored in each run’s JSON file. There’s no separate session file—sessions are virtual groupings based on the session_name field.