DeepSeek R1 Theory Tutorial – Architecture, GRPO, KL Divergence

freeCodeCamp.org 26,262 lượt xem 1 month ago

Video Not Working? Fix It Now

Learn about DeepSeek R1's innovative AI architecture from @deeplearningexplained. The course explores how R1 achieves exceptional reasoning through reinforcement learning, focusing on Group Relative Policy Optimization (GRPO) and how it improves upon traditional PPO methods. You'll also understand KL divergence's role in model stability, with practical code examples and clear mathematical explanations.

❤️ Try interactive AI courses we love, right in your browser: https://scrimba.com/freeCodeCamp-AI (Made possible by a grant from our friends at Scrimba)

Contents
⌨️ (0:00:00) Introduction
⌨️ (0:01:49) R1 Overview - Overview
⌨️ (0:03:52) R1 Overview - DeepSeek R1-zero path
⌨️ (0:05:32) R1 Overview - Reinforcement learning setup
⌨️ (0:08:36) R1 Overview - Group Relative Policy Optimization (GRPO)
⌨️ (0:13:04) R1 Overview - DeepSeek R1-zero result
⌨️ (0:16:53) R1 Overview - Cold start supervised fine-tuning
⌨️ (0:17:44) R1 Overview - Consistency reward for CoT
⌨️ (0:18:35) R1 Overview - Supervised Fine tuning data generation
⌨️ (0:21:06) R1 Overview - Reinforcement learning with neural reward model
⌨️ (0:22:53) R1 Overview - Distillation
⌨️ (0:26:16) GRPO - Overview
⌨️ (0:26:55) GRPO - PPO vs GRPO
⌨️ (0:30:25) GRPO - PPO formula overview
⌨️ (0:33:25) GRPO - GRPO formula overview
⌨️ (0:36:48) GRPO - GRPO pseudo code
⌨️ (0:38:56) GRPO - GRPO Trainer code
⌨️ (0:49:24) KL Divergence - Overview
⌨️ (0:49:55) KL Divergence - KL Divergence in GRPO vs PPO
⌨️ (0:51:20) KL Divergence - KL Divergence refresher
⌨️ (0:55:32) KL Divergence - Monte Carlo estimation of KL divergence
⌨️ (0:56:43) KL Divergence - Schulman blog
⌨️ (0:57:38) KL Divergence - k1 = log(q/p)
⌨️ (1:00:01) KL Divergence - k2 = 0.5*log(p/q)^2
⌨️ (1:02:19) KL Divergence - k3 = (p/q - 1) - log(p/q)
⌨️ (1:04:44) KL Divergence - benchmarking
⌨️ (1:07:28) Conclusion

🎉 Thanks to our Champion and Sponsor supporters:
👾 Drake Milly
👾 Ulises Moralez
👾 Goddard Tan
👾 David MG
👾 Matthew Springman
👾 Claudio
👾 Oscar R.
👾 jedi-or-sith
👾 Nattira Maneerat
👾 Justin Hual

--

Learn to code for free and get a developer job: https://www.freecodecamp.org

Read hundreds of articles on programming: https://freecodecamp.org/news

Comment