Claude Opus 4.5
Claude Opus 4.5
One-line summary: Anthropic's flagship model — per Boris Cherny, the best coding model he's used and, despite higher per-token cost, often cheaper end-to-end than Sonnet for agentic coding.
What it is
A Claude model (Opus tier, version 4.5). Supports a "thinking" mode that Boris recommends leaving on for all coding work.
Why it matters to this thread
It's the model Boris tells other Claude Code users to default to, and the lynchpin of his claim that picking the smartest model is an economic win — a counterintuitive result that contradicts naive "use cheaper tokens for cheaper work" reasoning.
Key claims from 2026-04-21-boris-claude-techniques
- Best coding model Boris has used. Explicitly his default for everything, with thinking enabled.
- Better tool use than smaller models; needs less steering.
- Planning step-change. "Once the plan is good, the code is good. This is definitely not the case with previous models." Boris attributes much of the recent Claude Code excitement to Opus 4.5's planning strength.
- Cheaper in practice than smaller models. Because it uses fewer tokens to reach a result, total cost often comes out lower even though per-token cost is higher. Evidence is experiential (not quantified in the source).
- Almost always faster end-to-end than picking a smaller model, even though Opus is "bigger and slower" per call.
Open questions
- What does the "smarter model = cheaper" math look like at scale? Boris asserts it as a rule-of-thumb; source contains no token-count or cost numbers.
- Where does the crossover point live? The claim may depend heavily on task type (plan-heavy vs. trivial edits) and workflow (parallel vs. serial).
Benchmark position (from 2026-04-21-autoresearch-best-ai-coding-tools)
As of April 2026 (per MorphLLM's benchmark summary):
- SWE-bench Verified: ~80.9% — tops the leaderboard; narrowly ahead of Gemini 3.1 Pro at ~80.6%.
- SWE-bench Pro (SEAL leaderboard): ~45.9% — tops this variant too; Claude Sonnet 4.5 second at ~43.6%.
- Aider Polyglot: Claude Opus 4.6 (successor) at ~85%, Claude Sonnet 4.6 at ~82%.
Caveat: OpenAI confirmed training-data leakage on SWE-bench Verified across every frontier model, and 59.4% of the hardest unsolved tasks had flawed tests — OpenAI stopped reporting Verified scores for this reason. See ai-coding-benchmarks. Opus 4.5's Verified score should be read with this caveat; the SEAL leaderboard variant is more defensible.
Newer models in the same family — Claude Opus 4.6, Claude Opus 4.7, and the provisional-leaderboard leader Claude Mythos Preview — now outscore 4.5 on some benchmarks. Opus 4.5 remains notable because it's the specific model Boris builds his entire workflow argument around; the benchmark claims in this page shouldn't be read as "the current frontier."