MENU

Fun & Interesting

Mastering LLM Inference Optimization From Theory to Cost Effective Deployment: Mark Moyou

AI Engineer 5,621 4 weeks ago
Video Not Working? Fix It Now

LLM inference is not your normal deep learning model deployment nor is it trivial when it comes to managing scale, performance and COST. Understanding how to effectively size a production grade LLM deployment requires understanding of the model(s), the compute hardware, quantization and parallelization methods, KV Cache budgets, input and output token length predictions, model adapter management and much more. If you want to deeply understand these topics and their effects on LLM inference cost and performance you will enjoy this talk. This talk will cover the following topics: Why LLM inference is different to standard deep learning inference Current and future NVIDIA GPU overview - which GPU(s) for which models and why Understanding the importance of building inference engines Deep recap on the attention mechanism along with different types of popular attention mechanisms used in production Deep dive on KV Cache and managing KV Cache budgets to increase throughput per model deployment Parallelism (reducing latency) - mainly tensor parallelism but data, sequence, pipeline and expert parallelism will be highlighted Quantization methods on weights, activations, KV Cache to reduce engine sizes for more effective GPU utilization Increasing throughput with inflight batching and other techniques Detailed performance analysis of LLM deployments looking at Time to first token, inter-token latencies, llm deployment characterizations, and more that can help reduce deployment costs The main inference engine referenced in the talk with TRT-LLM and the open-source inference serve NVIDIA Triton. Recorded live in San Francisco at the AI Engineer World's Fair. See the full schedule of talks at https://www.ai.engineer/worldsfair/2024/schedule & join us at the AI Engineer World's Fair in 2025! Get your tickets today at https://ai.engineer/2025 About Mark Dr. Mark Moyou is a Senior Data Scientist at NVIDIA on the Retail team focused on enabling scalable machine learning for the nation's top Retailers. Before NVIDIA, he was a Data Science Manager in the Professional Services division at Lucidworks, an Enterprise Search and Recommendations company. Prior to Lucidworks, he was a founding Data Scientist at Alstom Transportation where he applied Data Science to the Railroad Industry in the US. Mark holds a PhD and MSc in Systems Engineering and a BSc in Chemical Engineering. On the side, Mark is the host of The AI Portfolio Podcast, The Caribbean Tech Pioneers, Progress Guaranteed Podcast and Director of the Southern Data Science Conference in Atlanta.

Comment