brain/
← all mechanisms
med-high convictionactive · updated 2026-05-21T00:00:00.000Z

MCTS per-move target → sidesteps credit assignment → LLM RL sample-inefficiency

AlphaGo's MCTS manufactures a strictly-better action for every move (a low-variance supervised target grounded by a learnable value function), which sidesteps credit assignment. LLM policy-gradient RL gets one scalar reward per long trajectory and must credit-assign across 100k+ tokens — so its learning signal is high-variance and sample-inefficient.

The chain
1
Go has a concretely groundable value function: you can resolve who wins cheaply (Trump-Taylor scoring), and a learned value head shortcuts deep search by predicting win probability from a board state.
eric-jang in 2026-05-15-dwarkesh-podcast-eric-jang-building-alphago-from-scratch: "you can train a value function to look at a board and quickly resolve the game without playing out all of these trees into a very deep search depth."
2
Because the value function is groundable, MCTS on each move returns a visit-count distribution that is strictly better than the raw policy's guess — a clean per-move supervised target — without ever unrolling a full trajectory.
eric-jang in 2026-05-15-dwarkesh-podcast-eric-jang-building-alphago-from-scratch: "on every single action I'm going to give you a strictly better action that you should take instead. It does not guarantee that you are going to win, but it does guarantee that if you take these tuples as training data ... you're going to do better."
3
Training the policy to imitate that per-move target is just supervised learning on improved labels, so the variance of the learning signal is very low and it sidesteps credit assignment entirely — MCTS relabels actions rather than crediting wins.
eric-jang in 2026-05-15-dwarkesh-podcast-eric-jang-building-alphago-from-scratch: "this is a very, very nice idea because you have one supervision target for every single action. So the variance of your learning signal is very low compared to the alternative naive RL thing."
eric-jang in 2026-05-15-dwarkesh-podcast-eric-jang-building-alphago-from-scratch: "Monte Carlo tree search ... is not trying to do credit assignment on wins, it's trying to improve the label for any given action you took."
4
LLM policy-gradient RL has no comparable per-step oracle: it gets one scalar reward at the end of a 100k+ token trajectory and must ascribe credit across all tokens, so gradient variance grows with trajectory length and learning is sample-inefficient.
dwarkesh-patel in 2026-05-15-dwarkesh-podcast-eric-jang-building-alphago-from-scratch: "this thing you're saying, which would be intractable and prevents you from actually getting beyond a certain level in Go, is just by default how LLMs are trained ... Karpathy ... called it like sucking supervision through a straw."
eric-jang in 2026-05-15-dwarkesh-podcast-eric-jang-building-alphago-from-scratch: "in this case, this model free RL setting is trying to solve a credit assignment problem where you don't know which actions were actually good and which ones were bad."
What would falsify this
  • Step 4: A forward-search reasoning method (MuZero-style successor) demonstrably gives LLMs a per-step strictly-better target that closes the sample-efficiency gap with supervised learning.
  • Step 2: A per-token / per-step credit-assignment scheme for LLM RL matches MCTS's low-variance supervision without an explicit search tree.
Contradictions / tensions
  • LLMs arguably already do implicit MCTS-like backtracking in chain-of-thought; Jang agrees they 'do something that looks like real human reasoning without ... an explicit tree structure' but expects explicit forward search may 'make a comeback' — so the gap may be narrower than the chain implies.
  • Google reportedly tried tree structures for LLM reasoning in 2023-2024; Jang says 'the jury's still out as to whether this can ever work.'
Implications
  • Explains why current LLM RL treats the whole token sequence as a single action (T=1) rather than per-token: doing per-token RL cross-multiplies reward terms and magnifies variance.
  • Initialization to a non-zero pass rate (expert games for Go; SFT for LLMs) is essential because RL learns almost nothing in the low-pass-rate regime — see rl-information-inefficiency.
  • A groundable, hard-to-cheat verifier is the enabling ingredient; where one can't be constructed (open-ended research direction), the human stays in the loop — see automated-ai-research-llm-capability-boundary.
Companies
Concepts
Open questions