llmbench — benchmark any AI model · any provider

2026 — Open Source · Python · MIT

██╗     ██╗     ███╗   ███╗██████╗ ███████╗███╗   ██╗ ██████╗██╗  ██╗
██║     ██║     ████╗ ████║██╔══██╗██╔════╝████╗  ██║██╔════╝██║  ██║
██║     ██║     ██╔████╔██║██████╔╝█████╗  ██╔██╗ ██║██║     ███████║
██║     ██║     ██║╚██╔╝██║██╔══██╗██╔══╝  ██║╚██╗██║██║     ██╔══██║
███████╗███████╗██║ ╚═╝ ██║██████╔╝███████╗██║ ╚████║╚██████╗██║  ██║
╚══════╝╚══════╝╚═╝     ╚═╝╚═════╝ ╚══════╝╚═╝  ╚═══╝ ╚═════╝╚═╝  ╚═╝

Benchmark any AI model — any provider, one command. A CLI harness for measuring throughput, quality, and image generation, an agent loop with five sandboxed tasks, and a unified leaderboard explorer for published scores.

5Agentic tasks
9Provider adapters
4Leaderboard sources
110Passing tests

Install the CLI → Try the TUI Browse leaderboards Source on GitHub ↗

Install.

One command, any Python-capable machine. uvx and pipx run are Python's equivalents of npx.

Using uv

uvx llmbench

Using pipx

pipx run llmbench

From source

pip install llmbench

Python 3.11+. Full options, suite.yaml schema, and provider setup on GitHub.

Interactive.

A pixel-faithful clone of the questionary TUI. Use ↑ ↓ and ↵ to navigate, or click. The colors, banner, and menu structure match what you'll see when you run llmbench in your terminal.

$ llmbench

██╗     ██╗     ███╗   ███╗██████╗ ███████╗███╗   ██╗ ██████╗██╗  ██╗
██║     ██║     ████╗ ████║██╔══██╗██╔════╝████╗  ██║██╔════╝██║  ██║
██║     ██║     ██╔████╔██║██████╔╝█████╗  ██╔██╗ ██║██║     ███████║
██║     ██║     ██║╚██╔╝██║██╔══██╗██╔══╝  ██║╚██╗██║██║     ██╔══██║
███████╗███████╗██║ ╚═╝ ██║██████╔╝███████╗██║ ╚████║╚██████╗██║  ██║
╚══════╝╚══════╝╚═╝     ╚═╝╚═════╝ ╚══════╝╚═╝  ╚═══╝ ╚═════╝╚═╝  ╚═╝

benchmark any AI model · any provider · one command

API keys: Anthropic OpenAI Gemini Moonshot

Pure demo. Clicking through doesn't run benchmarks. Install llmbench to drive the real menu.

Capabilities.

Two evaluation modes, two outputs. Benchmarks compare models on raw throughput and quality. Agentic tasks run a model through a multi-step scenario in a sealed sandbox; the verdict is computed from post-run state, not text.

Mode A · single completion

Benchmarks

One prompt in, one completion out, repeated.

throughput TTFT, tokens/sec, inter-chunk latency, total latency, token usage.
quality_exact Deterministic check vs expected: exact, contains, regex.
quality_judge LLM-as-judge: 1–10 score with one-line reasoning.
image_gen Latency plus saved PNGs for visual review (e.g. CatBench).

Output results/<run_id>/gallery.html + SQLite (payload_json per row, queryable with json_extract).

Mode B · multi-step

Agentic tasks

Sandboxed scenarios, post-run verdicts.

file-refactor Rename a function across a 5-file mock project without breaking parsing.
api-orchestration GET a list, transform each row, POST to an audit endpoint with a field rename.
multi-step-research Synthesize four canned search results about a fictional company into a markdown brief.
recovery Retry a transient transactional failure and verify the side effect landed.
long-horizon Parse a config, fetch sources, and write a multi-section report (~15 steps).

Output runs/<run_id>.json · full TraceDocument (turns, tool calls, tokens, timing, cost, verdict).

Sandbox primitives fake_fs fake_http fake_sql fake_search fake_shell failure_injector

Verdicts read post-run sandbox state directly so a model can't text-its-way-through. Behavior flags (hallucinated_tool, excessive_http_calls, recovered_from_transient_failure, unregistered_search_query, unexpected_delete) are surfaced informationally per run.

Leaderboards.

Published model scores from HuggingFace Open LLM v2, LMArena ELO, Aider Polyglot, and a bundled offline snapshot — searchable in one place. Click a column to sort, toggle sources, or filter by name.

Loading…

#	Model	Organization	Source	Score	Metric

Published numbers aren't directly comparable to each other — different environments, different prompts, different scoring. Treat them as context, not ground truth. Refreshed daily by GitHub Actions.

Providers.

Chat, image, and local. One adapter per provider; new ones are typically ~40 lines.

Anthropicchat
OpenAIchat · image
Google Geminichat · image
Moonshotchat
Ollamalocal
vLLMlocal
LM Studiolocal
Black Forest Labsimage
OpenAI-compatibleany base_url