Sparse Mixture of Experts - The transformer behind the most efficient LLMs (DeepSeek, Mixtral)

Neural Breakdown with AVB 2,133 1 month ago

Video Not Working? Fix It Now

In this video, we discuss Mixture of Experts Transformers - the backbone behind popular LLMs like DeepSeek V3, Mixtral 8x22B, and more. You will learn concepts like Dense MOEs, Sparse MOEs, Top-K Routing, Noisy Routing, Expert Capacity, Switch Transformers, Auxilliary load balancing losses, and many more. Everything is presented visually to help conceptualize what is going on, and code snippets are provided to make it more concrete! Join the channel on Patreon to receive updates about the channel, and get access to bonus content used in all my videos. You will get the slides, notebooks, code snippets, word docs, and animations that went into producing this video. Here is the link: https://www.patreon.com/NeuralBreakdownwithAVB Visit AI Agent Store Page: https://aiagentstore.ai/?ref=avishek #pytorch #transformers #deepseek Videos and playlists you would like: Attention to Transformers playlist: https://www.youtube.com/playlist?list=PLGXWtN1HUjPfq0MSqD5dX8V7Gx5ow4QYW Guide to fine-tuning open source LLMs: https://youtu.be/bZcKYiwtw1I Generative Language Modeling from scratch: https://youtu.be/s3OUzmUDdg8 References and additional links: Sparse Mixture of Experts paper: https://arxiv.org/abs/1701.06538 Mixtral of Experts: https://arxiv.org/abs/2401.04088 DeepSeek V2: https://arxiv.org/abs/2405.04434 DeepSeek V3: https://arxiv.org/abs/2412.19437 Switch Transformers / Expert Capacity: https://arxiv.org/abs/2101.03961 A Blog post: https://brunomaga.github.io/Mixture-of-Experts A visual guide: https://newsletter.maartengrootendorst.com/p/a-visual-guide-to-mixture-of-experts Survey paper: https://arxiv.org/pdf/2407.06204 Timestamps: 0:00 - Intro 1:52 - Mixture of Experts Intuition 4:53 - Transformers 101 9:20 - Dense MOEs 14:50 - Sparse MOEs 16:34 - Router Collapse and Top-K Routing 19:20 - Noisy TopK, Load Balancing 20:56 - Routing Analysis by Mixtral 22:30 - Auxilliary Losses & DeepSeek 24:05 - Expert Capacity 26:07 - 6 Points to Remember

Comment