GRPO 2.0? DAPO LLM Reinforcement Learning Explained

AI Papers Academy 3,706 1 month ago

Video Not Working? Fix It Now

In this video, we break down DAPO: An Open-Source LLM Reinforcement Learning System at Scale — a new research paper from ByteDance that introduces DAPO (Decoupled Clip and Dynamic sAmpling Policy Optimization), a powerful reinforcement learning (RL) algorithm built on GRPO (Grouped Relative Policy Optimization). DAPO tackles key challenges in training large language models (LLMs) with RL, especially issues encountered when trying to reproduce DeepSeek-R1’s results. The researchers trained Qwen2.5-32B with DAPO, achieving 50 points on the challenging AIME 2024 benchmark — outperforming DeepSeek-R1's 47 points while using only 50% of the training steps. Written Review - https://aipapersacademy.com/dapo/ Paper - https://arxiv.org/abs/2503.14476 Code & Dataset - https://github.com/BytedTsinghua-SIA/DAPO #ai #reinforcementlearning #llm #deepseek #grpo #dapo #rl #airesearch _ 🔔 Subscribe for more AI paper reviews! 📩 Join the newsletter → https://aipapersacademy.com/newsletter/ Patreon - https://www.patreon.com/aipapersacademy The video was edited using VideoScribe - https://tidd.ly/44TZEiX _ Chapters: 0:00 Introduction 2:30 Introducing DAPO 5:05 Clip-Higher 7:45 Dynamic Sampling 9:35 Token-Level Loss 11:13 Overlong Responses 12:23 Ablation Study 12:57 KL Divergence Removal

Comment