AI Coding Benchmarks
AI Coding Benchmarks
One-line summary: The benchmark ecosystem has specialized by what it measures — no single leaderboard ranks "the best AI coding tool" because (a) benchmarks measure models, not tools, (b) multiple benchmarks are contamination-compromised, and (c) the agent scaffolding around a model matters as much as the model itself.
The insight
Reading benchmark leaderboards as tool rankings is a category error. The major benchmarks measure models, but in practice the agent framework around a model — how it handles planning, context, tool use, feedback loops — matters as much as or more than the underlying LLM. MorphLLM's practitioner test makes this point directly: "Same model can score differently in different agents."
The benchmark landscape
From 2026-04-21-autoresearch-best-ai-coding-tools (synthesized primarily from MorphLLM's benchmark taxonomy):
| Benchmark | What it measures | Known weaknesses |
|---|---|---|
| SWE-bench Verified | 500 human-validated real GitHub issues in Python; must pass real test cases | Training-data leakage confirmed by OpenAI on every frontier model; 59.4% of hardest unsolved tasks had flawed tests; OpenAI stopped reporting Verified scores |
| SWE-bench Pro (SEAL) | 1,865 multi-file, multi-language tasks averaging 107 lines across 4.1 files; 250-turn limit; consistent tooling across agents | The most robust single benchmark per MorphLLM; still relatively new |
| Terminal-Bench | CLI workflows in Docker containers — file editing, git, test-running, multi-step debugging | Limited task coverage; biased toward DevOps-style work |
| Aider Polyglot | 133 problems in 8 languages; tests raw model coding ability (agent-framework-neutral) | Smaller task count; less representative of real codebases |
| LiveCodeBench | Problems published after model training cutoffs (LeetCode / Codeforces / AtCoder) | Most contamination-resistant signal; competitive-programming tasks don't reflect day-to-day coding |
| HumanEval / MBPP | Isolated function generation from docstrings (164 / 974 problems) | Saturated — frontier models score 90%+ without differentiation; "useful only as a baseline sanity check" |
April 2026 leaders
From 2026-04-21-autoresearch-best-ai-coding-tools (via MorphLLM):
| Benchmark | Top model | Score |
|---|---|---|
| SWE-bench Verified | Claude Opus 4.5 | ~80.9% |
| SWE-bench Verified | Gemini 3.1 Pro | ~80.6% |
| SWE-bench Pro (SEAL) | Claude Opus 4.5 | ~45.9% |
| SWE-bench Pro (SEAL) | Claude Sonnet 4.5 | ~43.6% |
| Terminal-Bench | GPT-5.3 Codex / Gemini 3.1 Pro | ~77.3% |
| Terminal-Bench | Claude Code (claude-opus-4-5 era / Opus 4.6) | ~72% |
| Aider Polyglot | Claude Opus 4.6 | ~85% |
Tool-level figures from MorphLLM's 15-agent practitioner test:
- SWE-bench Verified (agent-level, not model-level): claude-code 80.9%, google-antigravity 76.2%, codex-cli 75.2%.
- Terminal-Bench 2.0: codex-cli 77.3%, claude-code 65.4%.
- devin PR merge rate: 67% on defined tasks — a rare output-oriented metric (measures what the tool ships, not what it solves in isolation).
Why benchmark rankings don't translate to tool rankings
MorphLLM's core empirical finding: "Same model can score differently in different agents, revealing that scaffolding architecture matters as much as underlying models."
This decouples three things that leaderboards conflate:
- Raw model coding ability. Measured cleanest by Aider Polyglot (agent-neutral).
- Agent scaffolding quality. Measured indirectly by comparing the same model across different agent frameworks.
- Real-world productivity. Not measured by any of the above; closer to Devin's PR merge rate or the field data in ai-coding-productivity-paradox.
A tool that pairs a middling model with excellent scaffolding can beat a tool with a better model and weaker scaffolding. This is why claude-code benchmarks well even though Opus isn't always the top model across every benchmark — its scaffolding is load-bearing.
Contamination and saturation
- SWE-bench Verified contamination: OpenAI confirmed training-data leakage on every frontier model and 59.4% of the hardest unsolved tasks had flawed tests. OpenAI stopped reporting Verified scores. SWE-bench Pro's SEAL leaderboard is the preferred variant.
- HumanEval / MBPP saturation: Frontier models score 90%+ without meaningful differentiation. No longer useful for comparing top models.
- LiveCodeBench remains contamination-resistant by sourcing problems after model training cutoffs.
What benchmarks still don't measure
MorphLLM flags this explicitly: "No benchmark measures cost-per-task, latency, or real-world workflow integration including code reviews and team communication." The most important properties of a tool — what it costs to ship a feature, how long the loop takes with human review, how it plays with a team's existing process — aren't captured anywhere in the leaderboard ecosystem.
Design implications
- Prefer SWE-bench Pro (SEAL) + Aider Polyglot as a pair — one tests agent-scaffolding-plus-model, the other tests pure model ability. Together they triangulate.
- Prefer Terminal-Bench for tier-2 / terminal-agent selection specifically.
- Prefer LiveCodeBench when you want contamination-clean reasoning signal.
- Ignore HumanEval at the frontier. It's saturated.
- Don't let model benchmarks pick your tool. The scaffolding-vs-model point means the best model doesn't always live in the best tool.
- Watch for output-oriented metrics (like Devin's PR merge rate). These are rare but more informative than pass@k on isolated tasks.
Open questions
- What would a rigorous "cost-per-task" benchmark look like? MorphLLM flags the absence; nobody has filled the gap.
- Can scaffolding quality be benchmarked independently of the model? Decoupling would be hugely valuable for evaluating new agent frameworks.
- Do benchmark scores predict field productivity at all? ai-coding-productivity-paradox suggests not clearly.