Here's an overview of the DeepSeek R1 paper. I read the paper this week and I was fascinated by the methods, however it was a bit difficult to follow what was going on with all the models being used.
I found a neat map of the methodology which I'll be using in this tutorial to walk you through the paper.
I strongly recommend you to still read the paper over here:
📌 PAPER: https://arxiv.org/pdf/2501.12948
and also to check out these other two video for the GRPO bit:
📌 https://www.youtube.com/watch?v=XMnxKGVnEUc&ab_channel=UmarJamil
📌 https://www.youtube.com/watch?v=bAWV_yrqx4w&ab_channel=YannicKilcher
btw map I'm using is over here:
https://www.reddit.com/r/LocalLLaMA/comments/1i66j4f/deepseekr1_training_pipeline_visualized/
Table of content
- Introduction: 0:00
- DeepSeek R1-zero path: 2:23
- Reinforcement learning setup: 3:59
- Group Relative Policy Optimization (GRPO): 7:03
- DeepSeek R1-zero result: 11:40
- Cold start supervised fine-tuning: 15:30
- Consistency reward for CoT: 16:19
- Supervised Fine tuning data generation: 17:17
- Reinforcement learning with neural reward model: 19:47
- Distillation: 21:26
- Conclusion: 24:34
----
Join the newsletter for weekly AI content: https://yacinemahdid.com
Join the Discord for general discussion: https://discord.gg/QpkxRbQBpf
----
Follow Me Online Here:
GitHub: https://github.com/yacineMahdid
LinkedIn: https://www.linkedin.com/in/yacinemahdid/
___
Have a great week! 👋