vLLM Office Hours - Distributed Inference with vLLM - January 23, 2025

Neural Magic 2,331 lượt xem 2 months ago

Video Not Working? Fix It Now

In this session, we explored the motivation for distributed inference, delving into vLLM architecture and GPU parallelism to enhance performance. We discussed the challenges of serving large models, introduced the concept of tensor parallelism, and examined the benefits and trade-offs of leveraging multiple GPUs for inference. We also highlighted profiling tools for analyzing kernel performance and overhead, along with the potential challenges of adopting a disaggregated approach with separate nodes for prefill and decoding.

During the open discussion, we addressed various community questions, including practical applications of tensor parallelism in real-world scenarios, the impact of distributed inference on latency and throughput, and strategies for optimizing multi-GPU setups.

Session slides: https://docs.google.com/presentation/d/10o1olgyQ3UH1AMQ_uln7ptXNahZRFdhZ/

Join our bi-weekly vLLM Office Hours to learn about the latest features and updates: https://hubs.li/Q02Y5Pbh0

distributed inference

distributed inference with vllm

tensor parallelism

gpu parallelism

Comment