At Ray Summit 2024, Sangbin Cho from Anyscale and Murali Andoorveedu from Centml explore the development and future of multi-GPU inference in vLLM. Their presentation focuses on the unique challenges posed by distributed inference for large language models, distinguishing it from distributed training.
Cho and Andoorveedu delve into various parallelism strategies, including tensor parallelism, pipeline parallelism, and expert parallelism, explaining how each works in detail. Using vLLM as a case study, they demonstrate how to construct an optimized architecture for efficient distributed inference. This talk provides valuable insights into the complexities of scaling LLM inference across multiple GPUs and offers a glimpse into the roadmap for future developments in this critical area of AI infrastructure.
--
Interested in more?
- Watch the full Day 1 Keynote: https://youtu.be/jwZHJthQvXo
- Watch the full Day 2 Keynote https://youtu.be/Lury2ad6KG8
--
🔗 Connect with us:
- Subscribe to our YouTube channel: https://www.youtube.com/@anyscale
- Twitter: https://x.com/anyscalecompute
- LinkedIn: https://linkedin.com/company/joinanyscale/
- Website: https://www.anyscale.com