RL is more information-inefficient than you think (bits-per-flop)

Vintage: May 2026. From 2026-05-15-dwarkesh-podcast-eric-jang-building-alphago-from-scratch. This is Dwarkesh Patel's own framing (he wrote it up in a blog post "a few months ago"), corroborated and extended by Eric Jang in the episode. An algorithmic-economics claim about why RL is sample-hungry relative to supervised learning — durable as long as the policy-gradient paradigm holds.

One-line summary: Reinforcement learning's information yield can be decomposed as bits-per-flop = samples-per-flop × bits-per-sample, and both factors degrade. As tasks get longer-horizon, samples-per-flop falls (you must unroll a whole trajectory before getting any signal). And bits-per-sample is intrinsically low because a binary win/lose reward on a rarely-sampled correct answer teaches far less than supervised cross-entropy against the true label — most of training is spent at near-zero pass rate, learning almost nothing.

The insight

Two compounding inefficiencies, both relative to supervised learning:

1. Samples-per-flop falls with horizon. Naive RL needs a full trajectory rolled out before any learning signal exists. As agents move from "complete the next word" to "do two days of work, then check if the project is correct," the flops spent per usable sample balloon. The information extracted per flop drops as the horizon lengthens.

2. Bits-per-sample is low — and worst exactly where you spend most of training. Dwarkesh's worked example: an untrained LLM with a 100k-token vocabulary, prompt "the sky is ___". Supervised learning hands it the label "blue" and it learns −log(p) bits via cross-entropy (the further its guess, the more it learns). RL must randomly stumble onto "blue" — on the order of 100,000 tries for an untrained model — before getting any signal. The bits learned from an RL sample is the entropy of a binary win/lose variable (max ~1 bit, at a 50% pass rate / coin flip). But you spend almost all of training in the low-pass-rate regime (pass rate 1/100,000 → 1/10,000 → ...), where you learn "incredibly little" per sample and may "never even get a single success." Jang's sharpening: you sample under your own policy's distribution, so "if your policy has no chance of sampling blue, then you will never get a signal."

Why distillation helps. A soft label (the full logit distribution / "dark knowledge") has far higher entropy than a one-hot token, so it carries far more bits-per-sample. This is why distillation is so sample-efficient — and why AlphaGo trains the policy to imitate the MCTS visit-count distribution, not just the single MCTS-selected action.

Evidence

dwarkesh-patel in 2026-05-15-dwarkesh-podcast-eric-jang-building-alphago-from-scratch (the decomposition): "you're trying to maximize as you're learning bits per flop ... you can think of bits per flop as samples per flop times bits per sample. And what I just mentioned a second ago is that the samples per flop go down as RL becomes more and more long horizon. But at least this kind of naive RL is also terrible from a bits per sample perspective."
dwarkesh-patel in 2026-05-15-dwarkesh-podcast-eric-jang-building-alphago-from-scratch (the supervised-vs-RL contrast): "with supervised learning ... there's a label that says, actually the term here is blue ... Now, if you were doing this through rl, you would say the model would try. The sky is, nope, that's wrong ... you would have to do this on the order of 100,000 times in order to just stumble on blue, then get some learning signal off of that."
eric-jang in 2026-05-15-dwarkesh-podcast-eric-jang-building-alphago-from-scratch (the policy-distribution sharpening): "what's also tough here is that actually the distribution that you're sampling under is your policy's distribution. So it's like if your policy has no chance of sampling blue, then you will never get a signal."
eric-jang in 2026-05-15-dwarkesh-podcast-eric-jang-building-alphago-from-scratch (the depressing-plot framing): "you spend all your time here, potentially never even getting a single success ... once you're here, it's not at all obvious how you get to here ... there's a sort of question of how do you initialize so you're at least not at zero, but at a non zero pass rate."
eric-jang in 2026-05-15-dwarkesh-podcast-eric-jang-building-alphago-from-scratch (distillation / soft targets): "if you have access to the soft targets, the entropy of this distribution is far, far higher than the one hot. So there's actually way more information and bits per sample in a soft label. So that's why distillation is so effective per sample ... in AlphaGo you don't train the policy network to imitate the MCTS action, you train it to imitate the MCTS distribution."

Design implications

Initialization matters enormously. The whole game is getting to a non-zero pass rate so RL has a gradient to climb. AlphaGo's trick — initialize the value/policy from expert human games or self-play against an open-source bot — is the same move as warm-starting an LLM with supervised fine-tuning before RL. Jang's repeated practitioner advice: "always pick something that works and then get it to do something better."
Soft-label distillation is a bits-per-sample multiplier, not just a compute-saving convenience.
This is the quantitative underside of mcts-vs-llm-rl-credit-assignment: MCTS dodges the low-bits-per-sample regime by manufacturing a clean per-move target instead of relying on rare end-of-trajectory wins.

Contradictions / tensions

The framing says RL is information-inefficient, yet RLVR (RL from verifiable rewards) demonstrably produces strong coding/reasoning models. Jang notes increasing the number of games "to millions of samples" does yield "meaningful supervision," so the claim is about efficiency, not impossibility. No hard contradiction — RL works, it's just expensive in bits.

Open questions

Does the bits-per-flop accounting predict a hard ceiling for very-long-horizon agentic RL, or just a cost curve that scales with compute?
can-llms-choose-the-right-research-question — the research-automation question downstream of how efficiently models can be trained at all.

mcts-vs-llm-rl-credit-assignment — the credit-assignment side of the same inefficiency
mcts-per-move-target-to-llm-rl-inefficiency — canonical mechanism
eric-jang — primary source
dwarkesh-patel — origin of the bits-per-flop framing

RL is more information-inefficient than you think (bits-per-flop)

RL is more information-inefficient than you think (bits-per-flop)

The insight

Evidence

Design implications

Contradictions / tensions

Open questions

Related