AI Coding Benchmarks

One-line summary: The benchmark ecosystem has specialized by what it measures — no single leaderboard ranks "the best AI coding tool" because (a) benchmarks measure models, not tools, (b) multiple benchmarks are contamination-compromised, and (c) the agent scaffolding around a model matters as much as the model itself.

The insight

Reading benchmark leaderboards as tool rankings is a category error. The major benchmarks measure models, but in practice the agent framework around a model — how it handles planning, context, tool use, feedback loops — matters as much as or more than the underlying LLM. MorphLLM's practitioner test makes this point directly: "Same model can score differently in different agents."

The benchmark landscape

From 2026-04-21-autoresearch-best-ai-coding-tools (synthesized primarily from MorphLLM's benchmark taxonomy):

Benchmark	What it measures	Known weaknesses
SWE-bench Verified	500 human-validated real GitHub issues in Python; must pass real test cases	Training-data leakage confirmed by OpenAI on every frontier model; 59.4% of hardest unsolved tasks had flawed tests; OpenAI stopped reporting Verified scores
SWE-bench Pro (SEAL)	1,865 multi-file, multi-language tasks averaging 107 lines across 4.1 files; 250-turn limit; consistent tooling across agents	The most robust single benchmark per MorphLLM; still relatively new
Terminal-Bench	CLI workflows in Docker containers — file editing, git, test-running, multi-step debugging	Limited task coverage; biased toward DevOps-style work
Aider Polyglot	133 problems in 8 languages; tests raw model coding ability (agent-framework-neutral)	Smaller task count; less representative of real codebases
LiveCodeBench	Problems published after model training cutoffs (LeetCode / Codeforces / AtCoder)	Most contamination-resistant signal; competitive-programming tasks don't reflect day-to-day coding
HumanEval / MBPP	Isolated function generation from docstrings (164 / 974 problems)	Saturated — frontier models score 90%+ without differentiation; "useful only as a baseline sanity check"

April 2026 leaders

From 2026-04-21-autoresearch-best-ai-coding-tools (via MorphLLM):

Benchmark	Top model	Score
SWE-bench Verified	Claude Opus 4.5	~80.9%
SWE-bench Verified	Gemini 3.1 Pro	~80.6%
SWE-bench Pro (SEAL)	Claude Opus 4.5	~45.9%
SWE-bench Pro (SEAL)	Claude Sonnet 4.5	~43.6%
Terminal-Bench	GPT-5.3 Codex / Gemini 3.1 Pro	~77.3%
Terminal-Bench	Claude Code (claude-opus-4-5 era / Opus 4.6)	~72%
Aider Polyglot	Claude Opus 4.6	~85%

Tool-level figures from MorphLLM's 15-agent practitioner test:

SWE-bench Verified (agent-level, not model-level): claude-code 80.9%, google-antigravity 76.2%, codex-cli 75.2%.
Terminal-Bench 2.0: codex-cli 77.3%, claude-code 65.4%.
devin PR merge rate: 67% on defined tasks — a rare output-oriented metric (measures what the tool ships, not what it solves in isolation).

Why benchmark rankings don't translate to tool rankings

MorphLLM's core empirical finding: "Same model can score differently in different agents, revealing that scaffolding architecture matters as much as underlying models."

This decouples three things that leaderboards conflate:

Raw model coding ability. Measured cleanest by Aider Polyglot (agent-neutral).
Agent scaffolding quality. Measured indirectly by comparing the same model across different agent frameworks.
Real-world productivity. Not measured by any of the above; closer to Devin's PR merge rate or the field data in ai-coding-productivity-paradox.

A tool that pairs a middling model with excellent scaffolding can beat a tool with a better model and weaker scaffolding. This is why claude-code benchmarks well even though Opus isn't always the top model across every benchmark — its scaffolding is load-bearing.

Contamination and saturation

SWE-bench Verified contamination: OpenAI confirmed training-data leakage on every frontier model and 59.4% of the hardest unsolved tasks had flawed tests. OpenAI stopped reporting Verified scores. SWE-bench Pro's SEAL leaderboard is the preferred variant.
HumanEval / MBPP saturation: Frontier models score 90%+ without meaningful differentiation. No longer useful for comparing top models.
LiveCodeBench remains contamination-resistant by sourcing problems after model training cutoffs.

What benchmarks still don't measure

MorphLLM flags this explicitly: "No benchmark measures cost-per-task, latency, or real-world workflow integration including code reviews and team communication." The most important properties of a tool — what it costs to ship a feature, how long the loop takes with human review, how it plays with a team's existing process — aren't captured anywhere in the leaderboard ecosystem.

Design implications

Prefer SWE-bench Pro (SEAL) + Aider Polyglot as a pair — one tests agent-scaffolding-plus-model, the other tests pure model ability. Together they triangulate.
Prefer Terminal-Bench for tier-2 / terminal-agent selection specifically.
Prefer LiveCodeBench when you want contamination-clean reasoning signal.
Ignore HumanEval at the frontier. It's saturated.
Don't let model benchmarks pick your tool. The scaffolding-vs-model point means the best model doesn't always live in the best tool.
Watch for output-oriented metrics (like Devin's PR merge rate). These are rare but more informative than pass@k on isolated tasks.

Open questions

What would a rigorous "cost-per-task" benchmark look like? MorphLLM flags the absence; nobody has filled the gap.
Can scaffolding quality be benchmarked independently of the model? Decoupling would be hugely valuable for evaluating new agent frameworks.
Do benchmark scores predict field productivity at all? ai-coding-productivity-paradox suggests not clearly.

Sources

2026-04-21-autoresearch-best-ai-coding-tools

AI Coding Benchmarks

AI Coding Benchmarks

The insight

The benchmark landscape

April 2026 leaders

Why benchmark rankings don't translate to tool rankings

Contamination and saturation

What benchmarks still don't measure

Design implications

Open questions

Sources

Related