DistServe: disaggregating prefill and decoding for goodput-optimized LLM inference

PyTorch 2,241 lượt xem 5 months ago

Video Not Working? Fix It Now

PyTorch Expert Exchange Webinar: DistServe: disaggregating prefill and decoding for goodput-optimized LLM inference with Hao Zhang, Assistant Professor at Halıcıoğlu Data Science Institute and Department of Computer Science and Engineering (affiliate) at UC San Diego.

In this talk, I'll talk about our work DistServe (OSDI'24). DistServe disaggregates the prefill and decoding computation to eliminate interference between two phases, hence improves the performance of large language models (LLMs). DistServe has seen adoption in frameworks like vLLM and companies including Google.

Slides available at: https://drive.google.com/file/d/1MDw6zBzQFc2mkgUCy09ORwFRZYb-UuyU/view

Comment