Claude Opus 4.5

One-line summary: Anthropic's flagship model — per Boris Cherny, the best coding model he's used and, despite higher per-token cost, often cheaper end-to-end than Sonnet for agentic coding.

What it is

A Claude model (Opus tier, version 4.5). Supports a "thinking" mode that Boris recommends leaving on for all coding work.

Why it matters to this thread

It's the model Boris tells other Claude Code users to default to, and the lynchpin of his claim that picking the smartest model is an economic win — a counterintuitive result that contradicts naive "use cheaper tokens for cheaper work" reasoning.

Key claims from 2026-04-21-boris-claude-techniques

Best coding model Boris has used. Explicitly his default for everything, with thinking enabled.
Better tool use than smaller models; needs less steering.
Planning step-change. "Once the plan is good, the code is good. This is definitely not the case with previous models." Boris attributes much of the recent Claude Code excitement to Opus 4.5's planning strength.
Cheaper in practice than smaller models. Because it uses fewer tokens to reach a result, total cost often comes out lower even though per-token cost is higher. Evidence is experiential (not quantified in the source).
Almost always faster end-to-end than picking a smaller model, even though Opus is "bigger and slower" per call.

Open questions

What does the "smarter model = cheaper" math look like at scale? Boris asserts it as a rule-of-thumb; source contains no token-count or cost numbers.
Where does the crossover point live? The claim may depend heavily on task type (plan-heavy vs. trivial edits) and workflow (parallel vs. serial).

Benchmark position (from 2026-04-21-autoresearch-best-ai-coding-tools)

As of April 2026 (per MorphLLM's benchmark summary):

SWE-bench Verified: ~80.9% — tops the leaderboard; narrowly ahead of Gemini 3.1 Pro at ~80.6%.
SWE-bench Pro (SEAL leaderboard): ~45.9% — tops this variant too; Claude Sonnet 4.5 second at ~43.6%.
Aider Polyglot: Claude Opus 4.6 (successor) at ~85%, Claude Sonnet 4.6 at ~82%.

Caveat: OpenAI confirmed training-data leakage on SWE-bench Verified across every frontier model, and 59.4% of the hardest unsolved tasks had flawed tests — OpenAI stopped reporting Verified scores for this reason. See ai-coding-benchmarks. Opus 4.5's Verified score should be read with this caveat; the SEAL leaderboard variant is more defensible.

Newer models in the same family — Claude Opus 4.6, Claude Opus 4.7, Claude Opus 4.8 (released ~late May 2026, six weeks after 4.7), and the provisional-leaderboard leader Claude Mythos Preview — now outscore 4.5 on some benchmarks. Opus 4.5 remains notable because it's the specific model Boris builds his entire workflow argument around; the benchmark claims in this page shouldn't be read as "the current frontier."

Per 2026-06-01-podcast-moonshots-opus-4-8-beats-gpt-5-5-the-220b-openai-foundation (May 2026), Opus 4.8 reclaimed the coding lead from GPT-5.5: top of the Artificial Analysis Intelligence Index at 61.4 (+1.2 vs GPT-5.5), SWE-bench Pro 69.2 (vs 58.6), Humanity's Last Exam 57.9% with tools, "4× less likely to overlook bugs of its own code," and the only model to complete every case end-to-end on Anthropic's Super Agent Benchmark. The cadence is now monthly dot-releases — itself a benchmark-saturation signal. Anthropic also flagged forthcoming models to "rival Mythos."

Claude Opus 4.5

Claude Opus 4.5

What it is

Why it matters to this thread

Key claims from 2026-04-21-boris-claude-techniques

Open questions

Benchmark position (from 2026-04-21-autoresearch-best-ai-coding-tools)

Sources

Related