PyTorch Expert Exchange Webinar: DistServe: disaggregating prefill and decoding for goodput-optimized LLM inference with Hao Zhang, Assistant Professor at Halıcıoğlu Data Science Institute and Department of Computer Science and Engineering (affiliate) at UC San Diego.
In this talk, I'll talk about our work DistServe (OSDI'24). DistServe disaggregates the prefill and decoding computation to eliminate interference between two phases, hence improves the performance of large language models (LLMs). DistServe has seen adoption in frameworks like vLLM and companies including Google.
Slides available at: https://drive.google.com/file/d/1MDw6zBzQFc2mkgUCy09ORwFRZYb-UuyU/view