Why are DeepSeek V3 and R1 so Fast and Cheap?

Semi Learned Lectures 1,376 lượt xem 2 months ago

Video Not Working? Fix It Now

The Speed (and cost) advantage of DeepSeek-R1 is only partially due their new the training innovations (DRPO). Most of it is due to very fast DeepSeek-V3 model that it is based on.

We describe the 5 ways in ways Deepseek-V3 improves speed in simple language that is accessible to many. We also give an assessment on how quickly other LLM vendors will be able to emulate the exploits of DeepSeekV3.

Best heard at 1.25 or 1.5x Speed

@semilearned on 10x Engineers: “Pareto's 80-20 Rule, 10x Software Engineers, Stalin & Amdahl”:
https://www.youtube.com/watch?v=Am0-R_nBFDw

@semilearned introduction to training: “How to train Language Models like DeepSeek-R1”:
https://www.youtube.com/watch?v=T8fpusY8BB8

Chapters:
00:00 Introduction
01:20 Advantage of DRPO in Training
06:22 DeepSeekV3 is the Reason for Speed
07:04 Multi-Token Generation
10:15 The 3 Bottlenecks of LLMs
12:20 DeepSeek-V3 uses Low Precision numbers
16:22 DeepSeek-V3 uses Mixture of Experts
21:03 DeepSeek-V3’s uses MLA (Latent Attention)
24:50 DeepSeek-V3 Clever System & Software Engineering
26:32 Reflections

Comment