Skip to main content

Commands

Twevals has two main commands:
  • twevals serve - Start the web UI to browse and run evaluations interactively
  • twevals run - Run evaluations headlessly (for CI/CD pipelines)

twevals serve

Start the web UI to discover and run evaluations interactively.
twevals serve PATH [OPTIONS]
Where PATH can be:
  • A directory: twevals serve evals/
  • A file: twevals serve evals/customer_service.py
  • A specific function: twevals serve evals.py::test_refund
The UI opens automatically in your browser. Evaluations are discovered and displayed but not run until you click the Run button.

Options

-d, --dataset
string
default:"all"
Filter evaluations by dataset.
twevals serve evals/ --dataset customer_service
-l, --label
string
default:"all"
Filter evaluations by label. Can be specified multiple times.
twevals serve evals/ --label production -l critical
--results-dir
string
default:".twevals/runs"
Directory for JSON results storage.
twevals serve evals/ --results-dir ./my-results
--port
integer
default:"8000"
Port for web UI server.
twevals serve evals/ --port 3000
--session
string
Name for this evaluation session. Groups related runs together.
twevals serve evals/ --session model-comparison

twevals run

Run evaluations headlessly. Outputs minimal text by default (optimized for LLM agents). Use --visual for rich table output.
twevals run PATH [OPTIONS]
Where PATH can be:
  • A directory: twevals run evals/
  • A file: twevals run evals/customer_service.py
  • A specific function: twevals run evals.py::test_refund
  • A parametrized variant: twevals run evals.py::test_math[2-3-5]

Filtering Options

-d, --dataset
string
default:"all"
Filter evaluations by dataset. Can be specified multiple times.
twevals run evals/ --dataset customer_service
twevals run evals/ -d customer_service -d technical_support
-l, --label
string
default:"all"
Filter evaluations by label. Can be specified multiple times.
twevals run evals/ --label production
twevals run evals/ -l production -l critical
--limit
integer
Limit the number of evaluations to run.
twevals run evals/ --limit 10

Execution Options

-c, --concurrency
integer
default:"1"
Number of concurrent evaluations. 1 means sequential execution.
# Run 4 evaluations in parallel
twevals run evals/ --concurrency 4
twevals run evals/ -c 4
--timeout
float
Global timeout in seconds for all evaluations.
twevals run evals/ --timeout 30.0

Output Options

-v, --verbose
flag
Show stdout from eval functions (print statements, logs).
twevals run evals/ --verbose
twevals run evals/ -v
--visual
flag
Show rich progress dots, results table, and summary. Without this flag, output is minimal.
twevals run evals/ --visual
-o, --output
string
Override the default results path. When specified, results are saved only to this path (not to .twevals/runs/).
twevals run evals/ --output results.json
twevals run evals/ -o results.json
--no-save
flag
Skip saving results to file. Outputs JSON to stdout instead.
# Get results as JSON without writing to disk
twevals run evals/ --no-save | jq '.passed'

Session Options

--session
string
Name for this evaluation session. Groups related runs together.
twevals run evals/ --session model-comparison
--run-name
string
Name for this specific run. Used as file prefix.
twevals run evals/ --session model-comparison --run-name gpt4-baseline

Examples

Start the Web UI

# Discover evals in a directory and open the UI
twevals serve evals/

# Start UI on a custom port
twevals serve evals/ --port 8080

# Filter what's shown in the UI
twevals serve evals/ --dataset qa --label production

Run All Evaluations

twevals run evals/

Run Specific File

twevals run evals/customer_service.py

Run Specific Function

twevals run evals/customer_service.py::test_refund

Run Parametrized Variant

twevals run evals/math.py::test_addition[2-3-5]

Filter by Dataset and Label

twevals run evals/ --dataset qa --label production

Run with Concurrency and Timeout

twevals run evals/ -c 8 --timeout 60.0

Export Results

# Results auto-save to .twevals/runs/ by default
twevals run evals/

# Override output path
twevals run evals/ -o results.json

Verbose Debug Run

# Show eval stdout and rich output
twevals run evals/ -v --visual --limit 5

Production CI Pipeline

# Minimal output for LLM agents/CI
twevals run evals/ -c 16 --timeout 120

Session Tracking

# Group runs under a session
twevals run evals/ --session model-comparison --run-name baseline

# Continue the session with another run
twevals run evals/ --session model-comparison --run-name improved

Configuration File

Twevals supports a twevals.json config file for persisting default CLI options. The file is auto-generated in your project root on first run.

Default Config

{
  "concurrency": 1,
  "results_dir": ".twevals/runs"
}

Supported Options

OptionTypeDescriptionUsed by
concurrencyintegerNumber of concurrent evaluationsrun
timeoutfloatGlobal timeout in secondsrun
verbosebooleanShow stdout from eval functionsrun
results_dirstringDirectory for results storageserve
portintegerWeb UI server portserve

Precedence

CLI flags always override config values:
# Config has concurrency: 1, but this uses 4
twevals run evals/ -c 4

Editing via UI

Click the settings icon in the web UI header to view and edit config values. Changes are saved to twevals.json.

Exit Codes

CodeMeaning
0Evaluations completed (regardless of pass/fail)
Non-zeroError during execution (bad path, exceptions, etc.)
The CLI does not currently set non-zero exit codes for failed evaluations—only for execution errors. Check the JSON output or summary for pass/fail status.

Environment Variables

VariableDescription
TWEVALS_CONCURRENCYDefault concurrency level
TWEVALS_TIMEOUTDefault timeout in seconds

Output Format

Minimal Output (Default)

By default, twevals run outputs minimal text optimized for LLM agents and CI pipelines:
Running...
Results saved to .twevals/runs/swift-falcon_2024-01-15T10-30-00Z.json

Visual Output (—visual)

Use --visual for rich progress dots, results table, and summary:
Running...
customer_service.py ..F

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃                     customer_service                           ┃
┣━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┫
┃ Name                ┃ Status   ┃ Score    ┃ Latency           ┃
┡━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━┩
│ test_refund         │ ✓ passed │ 1.0      │ 0.23s             │
│ test_complaint      │ ✗ failed │ 0.0      │ 0.45s             │
└─────────────────────┴──────────┴──────────┴───────────────────┘

Summary: 1/2 passed (50.0%)

JSON File Output

Results are always saved as JSON to .twevals/runs/ (or custom path via -o):
{
  "run_id": "2024-01-15T10-30-00Z",
  "total": 2,
  "passed": 1,
  "failed": 1,
  "results": [...]
}