In this video, I break down Proximal Policy Optimization (PPO) from first principles, without assuming prior knowledge of Reinforcement Learning. By the end, you’ll understand the core RL building blocks that led to PPO, including:
🔵 Policy Gradient
🔵 Actor-Critic Models
🔵 The Value Function
🔵 The Generalized Advantage Estimate
In the LLM world, PPO was used to train reasoning models like OpenAI's o1/o3, and presumably Claude 3.7, Grok 3, etc. It’s the backbone of Reinforcement Learning with Human Feedback (RLHF) -- which helps align AI models with human preferences and Reinforcement Learning with Verifiable Rewards (RLVR), which gives LLMs reasoning abilities.
Papers:
- PPO paper: https://arxiv.org/pdf/1707.06347
- GAE paper: https://arxiv.org/pdf/1506.02438
- TRPO paper: https://arxiv.org/pdf/1502.05477
Well-written blogposts:
- https://danieltakeshi.github.io/2017/04/02/notes-on-the-generalized-advantage-estimation-paper/
- https://huggingface.co/blog/NormalUhr/rlhf-pipeline
- https://iclr-blog-track.github.io/2022/03/25/ppo-implementation-details/
Implementations:
- (Original) OpenAI Baseslines: https://github.com/openai/baselines/blob/ea25b9e8b234e6ee1bca43083f8f3cf974143998/baselines/ppo2
- Hugging Face: https://github.com/huggingface/trl/blob/main/trl/trainer/ppo_trainer.py
- Hugging Face docs: https://huggingface.co/docs/trl/main/en/ppo_trainer
Mother of all RL books (Barto & Sutton):
http://incompleteideas.net/book/RLbook2020.pdf
00:00 Intro
01:21 RL for LLMs
05:53 Policy Gradient
09:23 The Value Function
12:14 Generalized Advantage Estimate
17:17 End-to-end Training Algorithm
18:23 Importance Sampling
20:02 PPO Clipping
21:36 Outro
Special thanks to Anish Tondwalkar for discussing some of these concepts with me.
Note: At 21:10, A_t should have been inside the min. Thanks @t.w.7065 for catching this.