brain/
conceptartificial-intelligence

AI Coding Benchmarks

Notes

AI Coding Benchmarks

One-line summary: The benchmark ecosystem has specialized by what it measures — no single leaderboard ranks "the best AI coding tool" because (a) benchmarks measure models, not tools, (b) multiple benchmarks are contamination-compromised, and (c) the agent scaffolding around a model matters as much as the model itself.

The insight

Reading benchmark leaderboards as tool rankings is a category error. The major benchmarks measure models, but in practice the agent framework around a model — how it handles planning, context, tool use, feedback loops — matters as much as or more than the underlying LLM. MorphLLM's practitioner test makes this point directly: "Same model can score differently in different agents."

The benchmark landscape

From 2026-04-21-autoresearch-best-ai-coding-tools (synthesized primarily from MorphLLM's benchmark taxonomy):

BenchmarkWhat it measuresKnown weaknesses
SWE-bench Verified500 human-validated real GitHub issues in Python; must pass real test casesTraining-data leakage confirmed by OpenAI on every frontier model; 59.4% of hardest unsolved tasks had flawed tests; OpenAI stopped reporting Verified scores
SWE-bench Pro (SEAL)1,865 multi-file, multi-language tasks averaging 107 lines across 4.1 files; 250-turn limit; consistent tooling across agentsThe most robust single benchmark per MorphLLM; still relatively new
Terminal-BenchCLI workflows in Docker containers — file editing, git, test-running, multi-step debuggingLimited task coverage; biased toward DevOps-style work
Aider Polyglot133 problems in 8 languages; tests raw model coding ability (agent-framework-neutral)Smaller task count; less representative of real codebases
LiveCodeBenchProblems published after model training cutoffs (LeetCode / Codeforces / AtCoder)Most contamination-resistant signal; competitive-programming tasks don't reflect day-to-day coding
HumanEval / MBPPIsolated function generation from docstrings (164 / 974 problems)Saturated — frontier models score 90%+ without differentiation; "useful only as a baseline sanity check"

April 2026 leaders

From 2026-04-21-autoresearch-best-ai-coding-tools (via MorphLLM):

BenchmarkTop modelScore
SWE-bench VerifiedClaude Opus 4.5~80.9%
SWE-bench VerifiedGemini 3.1 Pro~80.6%
SWE-bench Pro (SEAL)Claude Opus 4.5~45.9%
SWE-bench Pro (SEAL)Claude Sonnet 4.5~43.6%
Terminal-BenchGPT-5.3 Codex / Gemini 3.1 Pro~77.3%
Terminal-BenchClaude Code (claude-opus-4-5 era / Opus 4.6)~72%
Aider PolyglotClaude Opus 4.6~85%

Tool-level figures from MorphLLM's 15-agent practitioner test:

Why benchmark rankings don't translate to tool rankings

MorphLLM's core empirical finding: "Same model can score differently in different agents, revealing that scaffolding architecture matters as much as underlying models."

This decouples three things that leaderboards conflate:

  1. Raw model coding ability. Measured cleanest by Aider Polyglot (agent-neutral).
  2. Agent scaffolding quality. Measured indirectly by comparing the same model across different agent frameworks.
  3. Real-world productivity. Not measured by any of the above; closer to Devin's PR merge rate or the field data in ai-coding-productivity-paradox.

A tool that pairs a middling model with excellent scaffolding can beat a tool with a better model and weaker scaffolding. This is why claude-code benchmarks well even though Opus isn't always the top model across every benchmark — its scaffolding is load-bearing.

Contamination and saturation

  • SWE-bench Verified contamination: OpenAI confirmed training-data leakage on every frontier model and 59.4% of the hardest unsolved tasks had flawed tests. OpenAI stopped reporting Verified scores. SWE-bench Pro's SEAL leaderboard is the preferred variant.
  • HumanEval / MBPP saturation: Frontier models score 90%+ without meaningful differentiation. No longer useful for comparing top models.
  • LiveCodeBench remains contamination-resistant by sourcing problems after model training cutoffs.

What benchmarks still don't measure

MorphLLM flags this explicitly: "No benchmark measures cost-per-task, latency, or real-world workflow integration including code reviews and team communication." The most important properties of a tool — what it costs to ship a feature, how long the loop takes with human review, how it plays with a team's existing process — aren't captured anywhere in the leaderboard ecosystem.

Design implications

  • Prefer SWE-bench Pro (SEAL) + Aider Polyglot as a pair — one tests agent-scaffolding-plus-model, the other tests pure model ability. Together they triangulate.
  • Prefer Terminal-Bench for tier-2 / terminal-agent selection specifically.
  • Prefer LiveCodeBench when you want contamination-clean reasoning signal.
  • Ignore HumanEval at the frontier. It's saturated.
  • Don't let model benchmarks pick your tool. The scaffolding-vs-model point means the best model doesn't always live in the best tool.
  • Watch for output-oriented metrics (like Devin's PR merge rate). These are rare but more informative than pass@k on isolated tasks.

Open questions

  • What would a rigorous "cost-per-task" benchmark look like? MorphLLM flags the absence; nobody has filled the gap.
  • Can scaffolding quality be benchmarked independently of the model? Decoupling would be hugely valuable for evaluating new agent frameworks.
  • Do benchmark scores predict field productivity at all? ai-coding-productivity-paradox suggests not clearly.

Sources

Related

Referenced by