In this session, we brought five vLLM core committers together to share DeepSeek’s Open Source Week releases and their integration with vLLM, alongside what’s new in vLLM v0.7.2 and v0.7.3. We dove into key advancements: MLA support for better throughput, Multi-Token Prediction for faster inference, 256 Experts for massive MoE models, handling 671B-parameter models too big for a single H100 node, and FP8 Block Quantization for efficiency. These features push the limits of scalable, resource-efficient AI.
Session slides: https://docs.google.com/presentation/d/1h2Y7YbnbhuXrCh9rkQ33ZcC5MyB65oGK/
Join our bi-weekly vLLM Office Hours to learn about the latest features and updates: https://hubs.li/Q02Y5Pbh0