DeepSeek's GRPO (Group Relative Policy Optimization) | Reinforcement Learning for LLMs

Julia Turc 8,512 lượt xem 1 month ago

Video Not Working? Fix It Now

In this video, I break down DeepSeek's Group Relative Policy Optimization (GRPO) from first principles, without assuming prior knowledge of Reinforcement Learning. By the end, you’ll understand the core RL building blocks that led to GRPO, including:

🔵 Policy Gradient Methods
🔵 The REINFORCE Algorithm
🔵 Actor-Critic Models
🔵 PPO (Proximal Policy Optimization)
🔵 GRPO (Group-Relative policy Optimization)

Papers:
GRPO paper (DeepSeekMath): https://arxiv.org/pdf/2402.03300
DeepSeek-R1 paper: https://arxiv.org/pdf/2501.12948
PPO paper: https://arxiv.org/pdf/1707.06347
GAE paper: https://arxiv.org/pdf/1506.02438
TRPO paper: https://arxiv.org/pdf/1502.05477

Mother of all RL books (Barto & Sutton):
http://incompleteideas.net/book/RLboo...

00:00 Intro
00:53 Where GRPO fits within the LLM training pipeline
04:17 RL fundamentals for LLMs
08:25 Policy Gradient Methods & REINFORCE
11:58 Reward baselines & Actor-Critic Methods
14:10 GRPO
21:42 Wrap-up: PPO vs GRPO
22:32 Research papers are like Instagram

grpo

deepseek

reinforcement learning

llm

large language models

Comment