In this session, we brought on model compression expert Eldar Kurtić to discuss Model Quantization for Efficient vLLM Inference. Eldar shared the why, when, and how to quantize LLMs for efficient inference. He introduced a new library called llm-compressor for optimizing LLMs for accurate inference in vLLM.
Additionally, we touched on the vLLM v0.5.2 and v0.5.3 releases, including model support for Llama 3.1, Mistral-Nemo, and Chameleon. We also provided an update on AWQ Marlin and CPU offloading features.
Check out the session slides here: https://docs.google.com/presentation/d/1BhJmAP6ma2IuboExWB3USE12bjf4f5UW
Join our bi-weekly vLLM office hours to stay current with vLLM, ask questions, meet the community, and give feedback: https://hubs.li/Q02Y5Pbh0