Okay okay, spent my weekend gooning around learning GRPO / RL for lllms math. Here's some goods for you.
Essentially, this is me yapping through a recap of smaller details on how GRPO is implemented, what Dr. GRPO changes, why, DAPO, connections to PPO, aggregating batches, etc.
I know this format won't be for everyone, but I hope some of you love it!
RLHF Book: https://rlhfbook.com/c/11-policy-gradients.html#reinforce-leave-one-out-rloo
DeepSeekMath paper: https://arxiv.org/pdf/2402.03300
Where does ratio come from in PPO? https://ai.stackexchange.com/questions/37958/where-does-the-proximal-policy-optimization-objectives-ratio-term-come-from
DAPO: https://arxiv.org/pdf/2503.14476
DAPO announcement: https://x.com/qiying_yu/status/1902405115082104875
My DAPO recap: https://x.com/natolambert/status/1901758392043221072
Dr. GRPO: https://github.com/sail-sg/understand-r1-zero/blob/main/understand-r1-zero.pdf
Dr. GRPO announcement: https://x.com/zzlccc/status/1903162768083259703
TRL GRPO implementation: https://github.com/huggingface/trl/blob/07cfe1677e552b7d5c92b7740e5b2f0b057661d8/trl/trainer/grpo_trainer.py#L965
Unbiased GRPO implementation: https://github.com/sail-sg/oat/blob/7619b79a8804e813419faeda22bdd35cc4d9b9bd/oat/algorithms/ppo.py#L560
Thread on GRPO implementation on x: https://x.com/natolambert/status/1900639281791615387