entitypersonartificial-intelligence

Eric Jang

AI researcher (on sabbatical, 2026) · Former VP of AI at 1X Technologies · Former Senior Research Scientist, Google DeepMind Robotics

aka Eric Zhang

Quotes

“what MCTS is doing is basically saying like, you play this game where you eventually lost, but on every single action I'm going to give you a strictly better action that you should take instead. It does not guarantee that you are going to win, but it does guarantee that if you take these tuples as training data ... you're going to do better.”

2026-05-15-dwarkesh-podcast-eric-jang-building-alphago-from-scratch· 2026-05-15#mcts-vs-llm-rl-credit-assignment#mcts-per-move-target-to-llm-rl-inefficiency

“this is a very, very nice idea because you have one supervision target for every single action. So the variance of your learning signal is very low compared to the alternative naive RL thing.”

2026-05-15-dwarkesh-podcast-eric-jang-building-alphago-from-scratch· 2026-05-15#mcts-vs-llm-rl-credit-assignment#mcts-per-move-target-to-llm-rl-inefficiency

“in this case, this model free RL setting is trying to solve a credit assignment problem where you don't know which actions were actually good and which ones were bad. Monte Carlo tree search is doing something very fundamentally different, which is it's not trying to do credit assignment on wins, it's trying to improve the label for any given action you took.”

2026-05-15-dwarkesh-podcast-eric-jang-building-alphago-from-scratch· 2026-05-15#mcts-vs-llm-rl-credit-assignment#mcts-per-move-target-to-llm-rl-inefficiency

“for something like LLM reasoning, Pucked might actually not be a good enough heuristic. It might be too greedy with local tokens ... In an LLM you're most likely never going to sample the same child more than once.”

2026-05-15-dwarkesh-podcast-eric-jang-building-alphago-from-scratch· 2026-05-15#mcts-vs-llm-rl-credit-assignment

“the major reason is that you never have to initialize at a 0% success rate and solve the exploration problem of how to get a non zero success rate ... It's just supervised learning on a value classification as well as a policy KL minimization. So it's just a supervised learning problem on improved labels.”

2026-05-15-dwarkesh-podcast-eric-jang-building-alphago-from-scratch· 2026-05-15#mcts-vs-llm-rl-credit-assignment#rl-information-inefficiency

“if you have access to the soft targets, the entropy of this distribution is far, far higher than the one hot ... that's why distillation is so effective per sample ... in AlphaGo you don't train the policy network to imitate the MCTS action, you train it to imitate the MCTS distribution.”

2026-05-15-dwarkesh-podcast-eric-jang-building-alphago-from-scratch· 2026-05-15#rl-information-inefficiency

“what's also tough here is that actually the distribution that you're sampling under is your policy's distribution. So it's like if your policy has no chance of sampling blue, then you will never get a signal.”

2026-05-15-dwarkesh-podcast-eric-jang-building-alphago-from-scratch· 2026-05-15#rl-information-inefficiency

“the models can do a very good job of doing hyperparameter optimization ... it can search a much more open ended set of problems ... an almost like grad student like ability to just grind a performance metric ... it is also fantastic now at basically executing any experiment.”

2026-05-15-dwarkesh-podcast-eric-jang-building-alphago-from-scratch· 2026-05-15#automated-ai-research-llm-capability-boundary

“current closed models that we can access ... don't seem to be that great at selecting what the next experiment should be in a given track. And they don't seem to be able to kind of step back and do the lateral thinking ... I had to catch infra bugs myself.”

2026-05-15-dwarkesh-podcast-eric-jang-building-alphago-from-scratch· 2026-05-15#automated-ai-research-llm-capability-boundary#can-llms-choose-the-right-research-question

“a 10 layer neural network ... 10 steps of neural network paralyzed distributed representation thinking is able to amortize and approximate to a very, very high fidelity, a nearly intractable search problem ... it actually makes me wonder if our understanding of problems like P equals NP ... are incomplete.”

2026-05-15-dwarkesh-podcast-eric-jang-building-alphago-from-scratch· 2026-05-15#mcts-vs-llm-rl-credit-assignment

Notes

Eric Jang

One-line summary: AI researcher (robotics/RL); built a from-scratch AlphaGo reimplementation on sabbatical. Tracked here for the technical contrast between AlphaGo's MCTS supervision and LLM policy-gradient RL, and for a first-person capability-boundary read on automated AI research.

What they're known for

Brief factual context — fill in.

Why they matter to artificial-intelligence

Why this person's claims are tracked here — fill in.

Said

Speaker-attributed claims extracted from diarized sources. Each bullet mirrors one entry in quotes: frontmatter — keep them in sync.

On mcts-vs-llm-rl-credit-assignment, mcts-per-move-target-to-llm-rl-inefficiency:

"what MCTS is doing is basically saying like, you play this game where you eventually lost, but on every single action I'm going to give you a strictly better action that you should take instead. It does not guarantee that you are going to win, but it does guarantee that if you take these tuples as training data ... you're going to do better." — 2026-05-15-dwarkesh-podcast-eric-jang-building-alphago-from-scratch (2026-05-15)
On mcts-vs-llm-rl-credit-assignment, mcts-per-move-target-to-llm-rl-inefficiency:

"this is a very, very nice idea because you have one supervision target for every single action. So the variance of your learning signal is very low compared to the alternative naive RL thing." — 2026-05-15-dwarkesh-podcast-eric-jang-building-alphago-from-scratch (2026-05-15)
On mcts-vs-llm-rl-credit-assignment, mcts-per-move-target-to-llm-rl-inefficiency:

"in this case, this model free RL setting is trying to solve a credit assignment problem where you don't know which actions were actually good and which ones were bad. Monte Carlo tree search is doing something very fundamentally different, which is it's not trying to do credit assignment on wins, it's trying to improve the label for any given action you took." — 2026-05-15-dwarkesh-podcast-eric-jang-building-alphago-from-scratch (2026-05-15)
On mcts-vs-llm-rl-credit-assignment:

"for something like LLM reasoning, Pucked might actually not be a good enough heuristic. It might be too greedy with local tokens ... In an LLM you're most likely never going to sample the same child more than once." — 2026-05-15-dwarkesh-podcast-eric-jang-building-alphago-from-scratch (2026-05-15)
On mcts-vs-llm-rl-credit-assignment, rl-information-inefficiency:

"the major reason is that you never have to initialize at a 0% success rate and solve the exploration problem of how to get a non zero success rate ... It's just supervised learning on a value classification as well as a policy KL minimization. So it's just a supervised learning problem on improved labels." — 2026-05-15-dwarkesh-podcast-eric-jang-building-alphago-from-scratch (2026-05-15)
On rl-information-inefficiency:

"if you have access to the soft targets, the entropy of this distribution is far, far higher than the one hot ... that's why distillation is so effective per sample ... in AlphaGo you don't train the policy network to imitate the MCTS action, you train it to imitate the MCTS distribution." — 2026-05-15-dwarkesh-podcast-eric-jang-building-alphago-from-scratch (2026-05-15)
On rl-information-inefficiency:

"what's also tough here is that actually the distribution that you're sampling under is your policy's distribution. So it's like if your policy has no chance of sampling blue, then you will never get a signal." — 2026-05-15-dwarkesh-podcast-eric-jang-building-alphago-from-scratch (2026-05-15)
On automated-ai-research-llm-capability-boundary:

"the models can do a very good job of doing hyperparameter optimization ... it can search a much more open ended set of problems ... an almost like grad student like ability to just grind a performance metric ... it is also fantastic now at basically executing any experiment." — 2026-05-15-dwarkesh-podcast-eric-jang-building-alphago-from-scratch (2026-05-15)
On automated-ai-research-llm-capability-boundary, can-llms-choose-the-right-research-question:

"current closed models that we can access ... don't seem to be that great at selecting what the next experiment should be in a given track. And they don't seem to be able to kind of step back and do the lateral thinking ... I had to catch infra bugs myself." — 2026-05-15-dwarkesh-podcast-eric-jang-building-alphago-from-scratch (2026-05-15)
On mcts-vs-llm-rl-credit-assignment:

"a 10 layer neural network ... 10 steps of neural network paralyzed distributed representation thinking is able to amortize and approximate to a very, very high fidelity, a nearly intractable search problem ... it actually makes me wonder if our understanding of problems like P equals NP ... are incomplete." — 2026-05-15-dwarkesh-podcast-eric-jang-building-alphago-from-scratch (2026-05-15)

Sources

2026-05-15-dwarkesh-podcast-eric-jang-building-alphago-from-scratch

Cross-links — fill in.

Referenced by

Mechanisms

MCTS per-move target → sidesteps credit assignment → LLM RL sample-inefficiency

Concepts

MCTS vs LLM RL — why the credit-assignment problem makes LLM RL inefficient RL is more information-inefficient than you think (bits-per-flop)What AI research LLMs can and can't automate (the capability boundary)

Questions

Can LLMs choose the right research question to investigate (not just run the experiment)?

Eric Jang

Eric Jang

What they're known for

Why they matter to artificial-intelligence

Said

Sources

Related