brain/
← all entities
entitypersonartificial-intelligence

Eric Jang

AI researcher (on sabbatical, 2026) · Former VP of AI at 1X Technologies · Former Senior Research Scientist, Google DeepMind Robotics

aka Eric Zhang

Quotes

what MCTS is doing is basically saying like, you play this game where you eventually lost, but on every single action I'm going to give you a strictly better action that you should take instead. It does not guarantee that you are going to win, but it does guarantee that if you take these tuples as training data ... you're going to do better.

2026-05-15-dwarkesh-podcast-eric-jang-building-alphago-from-scratch· 2026-05-15#mcts-vs-llm-rl-credit-assignment#mcts-per-move-target-to-llm-rl-inefficiency

this is a very, very nice idea because you have one supervision target for every single action. So the variance of your learning signal is very low compared to the alternative naive RL thing.

2026-05-15-dwarkesh-podcast-eric-jang-building-alphago-from-scratch· 2026-05-15#mcts-vs-llm-rl-credit-assignment#mcts-per-move-target-to-llm-rl-inefficiency

in this case, this model free RL setting is trying to solve a credit assignment problem where you don't know which actions were actually good and which ones were bad. Monte Carlo tree search is doing something very fundamentally different, which is it's not trying to do credit assignment on wins, it's trying to improve the label for any given action you took.

2026-05-15-dwarkesh-podcast-eric-jang-building-alphago-from-scratch· 2026-05-15#mcts-vs-llm-rl-credit-assignment#mcts-per-move-target-to-llm-rl-inefficiency

for something like LLM reasoning, Pucked might actually not be a good enough heuristic. It might be too greedy with local tokens ... In an LLM you're most likely never going to sample the same child more than once.

2026-05-15-dwarkesh-podcast-eric-jang-building-alphago-from-scratch· 2026-05-15#mcts-vs-llm-rl-credit-assignment

the major reason is that you never have to initialize at a 0% success rate and solve the exploration problem of how to get a non zero success rate ... It's just supervised learning on a value classification as well as a policy KL minimization. So it's just a supervised learning problem on improved labels.

2026-05-15-dwarkesh-podcast-eric-jang-building-alphago-from-scratch· 2026-05-15#mcts-vs-llm-rl-credit-assignment#rl-information-inefficiency

if you have access to the soft targets, the entropy of this distribution is far, far higher than the one hot ... that's why distillation is so effective per sample ... in AlphaGo you don't train the policy network to imitate the MCTS action, you train it to imitate the MCTS distribution.

what's also tough here is that actually the distribution that you're sampling under is your policy's distribution. So it's like if your policy has no chance of sampling blue, then you will never get a signal.

the models can do a very good job of doing hyperparameter optimization ... it can search a much more open ended set of problems ... an almost like grad student like ability to just grind a performance metric ... it is also fantastic now at basically executing any experiment.

2026-05-15-dwarkesh-podcast-eric-jang-building-alphago-from-scratch· 2026-05-15#automated-ai-research-llm-capability-boundary

current closed models that we can access ... don't seem to be that great at selecting what the next experiment should be in a given track. And they don't seem to be able to kind of step back and do the lateral thinking ... I had to catch infra bugs myself.

2026-05-15-dwarkesh-podcast-eric-jang-building-alphago-from-scratch· 2026-05-15#automated-ai-research-llm-capability-boundary#can-llms-choose-the-right-research-question

a 10 layer neural network ... 10 steps of neural network paralyzed distributed representation thinking is able to amortize and approximate to a very, very high fidelity, a nearly intractable search problem ... it actually makes me wonder if our understanding of problems like P equals NP ... are incomplete.

2026-05-15-dwarkesh-podcast-eric-jang-building-alphago-from-scratch· 2026-05-15#mcts-vs-llm-rl-credit-assignment
Notes

Eric Jang

One-line summary: AI researcher (robotics/RL); built a from-scratch AlphaGo reimplementation on sabbatical. Tracked here for the technical contrast between AlphaGo's MCTS supervision and LLM policy-gradient RL, and for a first-person capability-boundary read on automated AI research.

What they're known for

Brief factual context — fill in.

Why they matter to artificial-intelligence

Why this person's claims are tracked here — fill in.

Said

Speaker-attributed claims extracted from diarized sources. Each bullet mirrors one entry in quotes: frontmatter — keep them in sync.

Sources

Related

Cross-links — fill in.

Referenced by