b. Bryan Zane Smith / llmbench / Vol. 04

Issue 04 · 2026 — Open Source · Python · MIT






Benchmark any AI model — any provider, one command. A CLI harness for measuring throughput, quality, and image generation, an agent loop with five sandboxed tasks, and a unified leaderboard explorer for published scores.

Install.

§ 01 — Zero setup

One command, any Python-capable machine. uvx and pipx run are Python's equivalents of npx.

Using uv
uvx llmbench
Using pipx
pipx run llmbench
From source
pip install llmbench

Python 3.11+. Full options, suite.yaml schema, and provider setup on GitHub.

Interactive.

§ 02 — Walk the menu

A pixel-faithful clone of the questionary TUI. Use and to navigate, or click. The colors, banner, and menu structure match what you'll see when you run llmbench in your terminal.

$ llmbench





benchmark any AI model · any provider · one command
API keys: Anthropic OpenAI Gemini Moonshot

Pure demo. Clicking through doesn't run benchmarks. Install llmbench to drive the real menu.

Capabilities.

§ 03 — What it measures

Two evaluation modes, two outputs. Benchmarks compare models on raw throughput and quality. Agentic tasks run a model through a multi-step scenario in a sealed sandbox; the verdict is computed from post-run state, not text.

Mode A · single completion

Benchmarks

One prompt in, one completion out, repeated.

  • throughput TTFT, tokens/sec, inter-chunk latency, total latency, token usage.
  • quality_exact Deterministic check vs expected: exact, contains, regex.
  • quality_judge LLM-as-judge: 1–10 score with one-line reasoning.
  • image_gen Latency plus saved PNGs for visual review (e.g. CatBench).
Output results/<run_id>/gallery.html + SQLite (payload_json per row, queryable with json_extract).
Mode B · multi-step

Agentic tasks

Sandboxed scenarios, post-run verdicts.

  • file-refactor Rename a function across a 5-file mock project without breaking parsing.
  • api-orchestration GET a list, transform each row, POST to an audit endpoint with a field rename.
  • multi-step-research Synthesize four canned search results about a fictional company into a markdown brief.
  • recovery Retry a transient transactional failure and verify the side effect landed.
  • long-horizon Parse a config, fetch sources, and write a multi-section report (~15 steps).
Output runs/<run_id>.json · full TraceDocument (turns, tool calls, tokens, timing, cost, verdict).
Sandbox primitives fake_fs fake_http fake_sql fake_search fake_shell failure_injector

Verdicts read post-run sandbox state directly so a model can't text-its-way-through. Behavior flags (hallucinated_tool, excessive_http_calls, recovered_from_transient_failure, unregistered_search_query, unexpected_delete) are surfaced informationally per run.

Leaderboards.

§ 04 — Published scores

Published model scores from HuggingFace Open LLM v2, LMArena ELO, Aider Polyglot, and a bundled offline snapshot — searchable in one place. Click a column to sort, toggle sources, or filter by name.

Loading…
# Model Organization Source Score Metric

Published numbers aren't directly comparable to each other — different environments, different prompts, different scoring. Treat them as context, not ground truth. Refreshed daily by GitHub Actions.

Providers.

§ 05 — Reach

Chat, image, and local. One adapter per provider; new ones are typically ~40 lines.