Demo-to-product gap: the "march of nines"

Vintage notes. The framework itself is durable — "each additional nine is constant work" is a structural claim about engineering deployment, not an AI-capability claim. The historical data points (Tesla 2017-2022, Waymo 2014-perfect-demo) are fixed retrospective facts that don't decay. What CAN decay is Karpathy's forward-looking 2025-10-17 framing that Waymo deployments are "still pretty minimal" and Tesla "isn't even near done" — that snapshot is from October 2025 and self-driving capability moves on roughly the same fast clock as AI more broadly. Re-validate the forward-looking judgments against newer evidence before relying on them. The framework + historical data + forward-looking-judgment-as-of-October-2025 is the right way to read this page.

One-line summary: Karpathy's first-person framework from leading Tesla Autopilot 2017–2022 — getting an autonomous system from working demo (90% / "first nine") to deployed product (99.999% / "fifth nine") is a constant amount of work per additional nine. The framework explains why self-driving has taken decades despite impressive early demos, and Karpathy applies the same model to forecast coding-agent deployment timelines.

The framework

Each additional "9" of reliability — moving from 90% to 99%, from 99% to 99.9%, and so on — takes roughly the same engineering effort as the previous one. Demos hit the first nine quickly because they're handpicked best-case scenarios. Product deployment requires every subsequent nine, and each one is a long fixed-cost slog.

Karpathy's concrete data point: at Tesla 2017–2022 (5 years leading Autopilot), the team advanced through "maybe three nines or two nines" of reliability progress — and still has more nines to go. The 2014 perfect-drive Waymo demo Karpathy rode in, and the 2025 state of Waymo's actual deployment, bracket roughly the same nines-progression.

This is not a claim that the work doesn't pay off. It's a claim about the shape of the work: linear-in-nines, exponential-in-required-reliability. Demos systematically mislead on remaining effort.

Why it matters

For autonomous-driving forecasting. The framework gives a concrete reason — separate from sensor stack, regulatory regime, or particular operator strategy — why deployment timelines stretch. Used to discount aggressive operator timelines (and corporate marketing demos) generally.
For software-agent deployment forecasting. Karpathy explicitly extends the framework to AI coding agents: software has the same critical-safety-domain property because bugs compound into security vulnerabilities and data leaks (see cross-thread link below). This is the central mechanism behind his "Decade of Agents, not Year of Agents" timeline framing in agi-timeline-decade-of-agents.
For evaluating demos. Karpathy: "I'm very unimpressed by demos. So whenever I see demos of anything, I'm extremely unimpressed." Cheap heuristic for reviewing operator presentations and lab releases.

Evidence

Origin: Karpathy's Tesla 2017–2022 experience

andrej-karpathy in 2025-10-17-dwarkesh-patel-andrej-karpathy-summoning-ghosts: "it's a march of nines. And every single nine is a constant amount of work. So every single nine is the same amount of work. So when you get a demo and something works 90% of the time, that's just the first nine and then you need the second nine and third nine, fourth nine, fifth nine. And while I was at Tesla for was it five years or so, I think we went through maybe three nines or two nines."
andrej-karpathy in 2025-10-17-dwarkesh-patel-andrej-karpathy-summoning-ghosts (post-tenure assessment that the work isn't done): "you've talked about how you were at Tesla leading self driving from 2017 to 2022... I will almost instantly also push back on is this is not even near done."

Generalization to other capabilities-deployment domains

andrej-karpathy in 2025-10-17-dwarkesh-patel-andrej-karpathy-summoning-ghosts (general framework): "for some kinds of tasks and jobs and so on, there's a very large demo to product gap where the demo is very easy, but the product is very hard. And it's especially the case in cases like self driving where the cost of failure is too high... I'm very unimpressed by demos. So whenever I see demos of anything, I'm extremely unimpressed by that."

The 2014 Waymo demo: a decade-long data point

andrej-karpathy in 2025-10-17-dwarkesh-patel-andrej-karpathy-summoning-ghosts: "when I was joining Tesla I had a very early demo of a Waymo and it basically gave me a perfect drive in 2014 or something like that. So perfect. Waymo Drive a decade ago took us around Palo Alto and so on, because I had a friend who worked either and I thought it was like very close and then still took a long time." See waymo for the current state of Waymo deployment after that decade.

Implications for this thread

Discount operator timelines that lean on demos. Tesla's optimus, FSD reveal videos, Waymo's no-driver rollouts — all impressive, all first-or-second-nine evidence. The deployment-grade question is which operator is making nines-progress, not which makes the prettiest demos.
The 5-year-for-2-3-nines data point is a useful baseline. Subsequent operators or stack-changes should be assessed against this — anyone claiming to skip multiple nines in less time should be heavily discounted unless the underlying capability change is structural.
Karpathy's "Tesla more scalable" assessment is held alongside this framework, not as a contradiction to it. See tesla-fsd for the full caveats — including his own admission that he's not fully independent.

Open questions

Does the march-of-nines framework hold uniformly across all safety-critical AI deployments? Karpathy assumes yes. The AI-2027 / fast-takeoff camp implicitly assumes the rate is much faster in software because deployment surfaces are more fault-tolerant. Direct test: tracking nines-progression on deployed coding agents over 2026–2028.
What concrete steps did Tesla actually clear during 2017–2022? Karpathy says "two or three nines" but the underlying telematics + intervention-rate data is closely held. Need independent data to validate or challenge the framework's calibration.

andrej-karpathy — primary source for the framework
tesla-fsd — primary operator-domain Karpathy uses to anchor the framework
waymo — adjacent operator whose 2014–2026 trajectory provides a second data point
fsd-safety-data-validity — adjacent: separate critique of Tesla's reported safety nines
agi-timeline-decade-of-agents (artificial-intelligence thread) — Karpathy's broader timeline thesis that uses the march-of-nines as a load-bearing argument
ai-coding-agent-asymmetry-on-novel-code (artificial-intelligence thread) — adjacent: the specific failure mode that makes coding-agent nines hard to close

Demo-to-product gap: the "march of nines"

Demo-to-product gap: the "march of nines"

The framework

Why it matters

Evidence

Origin: Karpathy's Tesla 2017–2022 experience

Generalization to other capabilities-deployment domains

The 2014 Waymo demo: a decade-long data point

Implications for this thread

Open questions

Related