In this video, we discuss Mixture of Experts Transformers - the backbone behind popular LLMs like DeepSeek V3, Mixtral 8x22B, and more. You will learn concepts like Dense MOEs, Sparse MOEs, Top-K Routing, Noisy Routing, Expert Capacity, Switch Transformers, Auxilliary load balancing losses, and many more. Everything is presented visually to help conceptualize what is going on, and code snippets are provided to make it more concrete!
Join the channel on Patreon to receive updates about the channel, and get access to bonus content used in all my videos. You will get the slides, notebooks, code snippets, word docs, and animations that went into producing this video. Here is the link:
https://www.patreon.com/NeuralBreakdownwithAVB
Visit AI Agent Store Page: https://aiagentstore.ai/?ref=avishek
#pytorch #transformers #deepseek
Videos and playlists you would like:
Attention to Transformers playlist: https://www.youtube.com/playlist?list=PLGXWtN1HUjPfq0MSqD5dX8V7Gx5ow4QYW
Guide to fine-tuning open source LLMs: https://youtu.be/bZcKYiwtw1I
Generative Language Modeling from scratch: https://youtu.be/s3OUzmUDdg8
References and additional links:
Sparse Mixture of Experts paper: https://arxiv.org/abs/1701.06538
Mixtral of Experts: https://arxiv.org/abs/2401.04088
DeepSeek V2: https://arxiv.org/abs/2405.04434
DeepSeek V3: https://arxiv.org/abs/2412.19437
Switch Transformers / Expert Capacity: https://arxiv.org/abs/2101.03961
A Blog post: https://brunomaga.github.io/Mixture-of-Experts
A visual guide: https://newsletter.maartengrootendorst.com/p/a-visual-guide-to-mixture-of-experts
Survey paper: https://arxiv.org/pdf/2407.06204
Timestamps:
0:00 - Intro
1:52 - Mixture of Experts Intuition
4:53 - Transformers 101
9:20 - Dense MOEs
14:50 - Sparse MOEs
16:34 - Router Collapse and Top-K Routing
19:20 - Noisy TopK, Load Balancing
20:56 - Routing Analysis by Mixtral
22:30 - Auxilliary Losses & DeepSeek
24:05 - Expert Capacity
26:07 - 6 Points to Remember