Benchmark any AI model — any provider, one command. A CLI harness for measuring throughput, quality, and image generation, an agent loop with five sandboxed tasks, and a unified leaderboard explorer for published scores.
One command, any Python-capable machine. uvx and pipx run are Python's equivalents of npx.
Using uv
uvx llmbench
Using pipx
pipx run llmbench
From source
pip install llmbench
Python 3.11+. Full options, suite.yaml schema, and provider setup on GitHub.
Interactive.
§ 02 — Walk the menu
A pixel-faithful clone of the questionary TUI. Use ↑↓ and ↵ to navigate, or click. The colors, banner, and menu structure match what you'll see when you run llmbench in your terminal.
↑↓navigate↵selectEscbackTabexit democlick to focus
Pure demo. Clicking through doesn't run benchmarks. Install llmbench to drive the real menu.
Capabilities.
§ 03 — What it measures
Two evaluation modes, two outputs. Benchmarks compare models on raw throughput and quality. Agentic tasks run a model through a multi-step scenario in a sealed sandbox; the verdict is computed from post-run state, not text.
Mode A · single completion
Benchmarks
One prompt in, one completion out, repeated.
throughputTTFT, tokens/sec, inter-chunk latency, total latency, token usage.
quality_exactDeterministic check vs expected: exact, contains, regex.
quality_judgeLLM-as-judge: 1–10 score with one-line reasoning.
image_genLatency plus saved PNGs for visual review (e.g. CatBench).
Mode B · multi-step
Agentic tasks
Sandboxed scenarios, post-run verdicts.
file-refactorRename a function across a 5-file mock project without breaking parsing.
api-orchestrationGET a list, transform each row, POST to an audit endpoint with a field rename.
multi-step-researchSynthesize four canned search results about a fictional company into a markdown brief.
recoveryRetry a transient transactional failure and verify the side effect landed.
long-horizonParse a config, fetch sources, and write a multi-section report (~15 steps).
Verdicts read post-run sandbox state directly so a model can't text-its-way-through. Behavior flags (hallucinated_tool, excessive_http_calls, recovered_from_transient_failure, unregistered_search_query, unexpected_delete) are surfaced informationally per run.
Leaderboards.
§ 04 — Published scores
Published model scores from HuggingFace Open LLM v2, LMArena ELO, Aider Polyglot, and a bundled offline snapshot — searchable in one place. Click a column to sort, toggle sources, or filter by name.
Loading…
#
Model
Organization
Source
Score
Metric
No models match your filters.
Published numbers aren't directly comparable to each other — different environments, different prompts, different scoring. Treat them as context, not ground truth. Refreshed daily by GitHub Actions.
Providers.
§ 05 — Reach
Chat, image, and local. One adapter per provider; new ones are typically ~40 lines.