brain/
conceptartificial-intelligence

MCTS vs LLM RL — why the credit-assignment problem makes LLM RL inefficient

Notes

MCTS vs LLM RL — why the credit-assignment problem makes LLM RL inefficient

Vintage: May 2026. Sourced from 2026-05-15-dwarkesh-podcast-eric-jang-building-alphago-from-scratch, a technical "build AlphaGo from scratch" walkthrough. The contrast it draws (AlphaGo's per-move supervision target vs. LLM policy-gradient over whole trajectories) is an algorithmic claim about current RL practice — re-validate against newer sources if frontier labs ship a forward-search reasoning method that closes the gap (Jang explicitly says "the jury is still out").

One-line summary: AlphaGo's Monte Carlo Tree Search supplies a strictly-better action for every single move (a low-variance supervised target), which sidesteps the credit-assignment problem. Naive LLM policy-gradient RL instead has to figure out which of 100k+ tokens in a trajectory earned the reward, so its learning signal is high-variance and sample-inefficient. Human learning is plausibly closer to the MCTS form than the LLM form.

The insight

The episode contrasts two ways of turning game/task outcomes into a training signal:

  1. MCTS (AlphaGo). On every move, the search runs many simulations from the current board, guided by the policy network (which moves to try) and truncated by the value network (a learned shortcut for "will I win from here?" that avoids playing to the end). The resulting visit-count distribution is peakier and better than the raw policy's guess. You then train the policy network to imitate that improved distribution. Crucially, MCTS is not doing credit assignment on wins — it is relabeling every action you took with a strictly-better action. So every move in a game (even a lost game) yields a clean supervised target. The learning signal's variance is therefore very low.

  2. Naive LLM policy-gradient RL. You roll out a whole trajectory (a chain of 100k+ tokens, or two days of agent work), get a single scalar reward at the end (did the unit test pass?), and then up-weight all the tokens in winning trajectories. Most of those tokens were neutral or irrelevant to the win — only one critical move out of hundreds may have mattered. The reward term must be ascribed across all tokens, producing high gradient variance that "grows quadratically with T" (trajectory length). Karpathy's phrase, quoted by Dwarkesh: "sucking supervision through a straw."

The reason MCTS works for Go and not (cleanly) for LLMs: Go's value function is concretely groundable (play a few moves to a Trump-Taylor resolution and you know who won), so the search yields a target that is almost certainly better than the current policy without ever unrolling a full trajectory. LLM reasoning has no comparable per-step "this move is strictly better" oracle, and language's action space is so large that the PUCT exploration heuristic (rewarding under-visited children via √N/(1+N_a)) breaks — you almost never sample the same "child" token-sequence twice.

Evidence

  • eric-jang in 2026-05-15-dwarkesh-podcast-eric-jang-building-alphago-from-scratch (MCTS relabels rather than credit-assigns): "what MCTS is doing is basically saying like, you play this game where you eventually lost, but on every single action I'm going to give you a strictly better action that you should take instead. It does not guarantee that you are going to win, but it does guarantee that if you take these tuples as training data ... you're going to do better."
  • eric-jang in 2026-05-15-dwarkesh-podcast-eric-jang-building-alphago-from-scratch (the low-variance property): "this is a very, very nice idea because you have one supervision target for every single action. So the variance of your learning signal is very low compared to the alternative naive RL thing."
  • dwarkesh-patel in 2026-05-15-dwarkesh-podcast-eric-jang-building-alphago-from-scratch (the LLM contrast, citing Karpathy): "this thing you're saying, which would be intractable and prevents you from actually getting beyond a certain level in Go, is just by default how LLMs are trained ... Karpathy, when he was on the podcast, called it like sucking supervision through a straw."
  • eric-jang in 2026-05-15-dwarkesh-podcast-eric-jang-building-alphago-from-scratch (the fundamental difference): "in this case, this model free RL setting is trying to solve a credit assignment problem where you don't know which actions were actually good and which ones were bad. Monte Carlo tree search is doing something very fundamentally different, which is it's not trying to do credit assignment on wins, it's trying to improve the label for any given action you took."
  • eric-jang in 2026-05-15-dwarkesh-podcast-eric-jang-building-alphago-from-scratch (why MCTS doesn't transfer to LLMs): "for something like LLM reasoning, Pucked might actually not be a good enough heuristic. It might be too greedy with local tokens ... In an LLM you're most likely never going to sample the same child more than once ... such a large number that this type of exploration heuristic is probably not the right thing to do to guide how to search down a tree."
  • eric-jang in 2026-05-15-dwarkesh-podcast-eric-jang-building-alphago-from-scratch (why AlphaGo's RL is elegant): "the major reason is that you never have to initialize at a 0% success rate and solve the exploration problem of how to get a non zero success rate ... It's just supervised learning on a value classification as well as a policy KL minimization. So it's just a supervised learning problem on improved labels."

The chain

MCTS supplies a per-move, near-optimal supervision target (groundable value function → strictly-better action each move) → this sidesteps credit assignment and gives low-variance supervised learning → whereas LLM policy-gradient RL gets one scalar reward per long trajectory and must credit-assign across all tokens → high variance, sample-inefficient learning. Canonical: mcts-per-move-target-to-llm-rl-inefficiency.

Design implications

  • The reason LLM RL does single-step RL (treating the whole token sequence as one action, T=1) rather than per-token RL is precisely to avoid the cross-multiplication of per-token reward terms that magnifies variance.
  • Forward-search-for-reasoning isn't dead: Jang expects "revisiting of this idea of forward search in the future," just not in AlphaGo's exact PUCT instantiation. Domains with rigid logical structure (mathematics) are more amenable than open-ended ones (business negotiation).
  • A groundable, hard-to-cheat verifier (Go win-rate) is the enabling ingredient — it's what makes the per-move improvement target trustworthy. See automated-ai-research-llm-capability-boundary for the same "if you can't evaluate it, you can't search it" principle applied to research automation.

Contradictions / tensions

  • LLMs arguably already do something MCTS-like implicitly — backtracking in chain-of-thought ("oh, that doesn't work, let's back up"). Jang agrees they "manage to do something that looks like real human reasoning without having to do an explicit tree structure," but holds that explicit forward search "might make a comeback." No hard contradiction; the open question is whether explicit search adds anything over learned implicit search.
  • Google reportedly tried tree structures for reasoning in 2023-2024; Jang says "the jury's still out as to whether this can ever work." Thin evidence either way.

Open questions

  • can-llms-choose-the-right-research-question — adjacent: even if the per-move-target advantage doesn't transfer, where does LLM reasoning's structure leave automated research?
  • Will explicit forward-search reasoning (MuZero-style successors) return for LLMs, and in which domains first (math before open-ended)?

Related

Referenced by