vLLM Office Hours - Model Quantization for Efficient vLLM Inference - July 25, 2024

Neural Magic 1,456 9 months ago

Video Not Working? Fix It Now

In this session, we brought on model compression expert Eldar Kurtić to discuss Model Quantization for Efficient vLLM Inference. Eldar shared the why, when, and how to quantize LLMs for efficient inference. He introduced a new library called llm-compressor for optimizing LLMs for accurate inference in vLLM. Additionally, we touched on the vLLM v0.5.2 and v0.5.3 releases, including model support for Llama 3.1, Mistral-Nemo, and Chameleon. We also provided an update on AWQ Marlin and CPU offloading features. Check out the session slides here: https://docs.google.com/presentation/d/1BhJmAP6ma2IuboExWB3USE12bjf4f5UW Join our bi-weekly vLLM office hours to stay current with vLLM, ask questions, meet the community, and give feedback: https://hubs.li/Q02Y5Pbh0

Comment