In this video, I break down DeepSeek's Group Relative Policy Optimization (GRPO) from first principles, without assuming prior knowledge of Reinforcement Learning. By the end, you’ll understand the core RL building blocks that led to GRPO, including:
🔵 Policy Gradient Methods
🔵 The REINFORCE Algorithm
🔵 Actor-Critic Models
🔵 PPO (Proximal Policy Optimization)
🔵 GRPO (Group-Relative policy Optimization)
Papers:
GRPO paper (DeepSeekMath): https://arxiv.org/pdf/2402.03300
DeepSeek-R1 paper: https://arxiv.org/pdf/2501.12948
PPO paper: https://arxiv.org/pdf/1707.06347
GAE paper: https://arxiv.org/pdf/1506.02438
TRPO paper: https://arxiv.org/pdf/1502.05477
Mother of all RL books (Barto & Sutton):
http://incompleteideas.net/book/RLboo...
00:00 Intro
00:53 Where GRPO fits within the LLM training pipeline
04:17 RL fundamentals for LLMs
08:25 Policy Gradient Methods & REINFORCE
11:58 Reward baselines & Actor-Critic Methods
14:10 GRPO
21:42 Wrap-up: PPO vs GRPO
22:32 Research papers are like Instagram