Speculative Decoding and Efficient LLM Inference with Chris Lott - 717

The TWIML AI Podcast with Sam Charrington 1,017 lượt xem 3 months ago

Video Not Working? Fix It Now

Today, we're joined by Chris Lott, senior director of engineering at Qualcomm AI Research to discuss accelerating large language model inference. We explore the challenges presented by the LLM encoding and decoding (aka generation) and how these interact with various hardware constraints such as FLOPS, memory footprint and memory bandwidth to limit key inference metrics such as time-to-first-token, tokens per second, and tokens per joule. We then dig into a variety of techniques that can be used to accelerate inference such as KV compression, quantization, pruning, speculative decoding, and leveraging small language models (SLMs). We also discuss future directions for enabling on-device agentic experiences such as parallel generation and software tools like Qualcomm AI Orchestrator.

🎧 / 🎥 Listen or watch the full episode on our page: https://twimlai.com/go/717.

🔔 Subscribe to our channel for more great content just like this: https://youtube.com/twimlai?sub_confirmation=1

🗣️ CONNECT WITH US!
===============================
Subscribe to the TWIML AI Podcast: https://twimlai.com/podcast/twimlai/
Follow us on Twitter: https://twitter.com/twimlai
Follow us on LinkedIn: https://www.linkedin.com/company/twimlai/
Join our Slack Community: https://twimlai.com/community/
Subscribe to our newsletter: https://twimlai.com/newsletter/
Want to get in touch? Send us a message: https://twimlai.com/contact/

📖 CHAPTERS
===============================
00:00 - Introduction
3:54 - LLMs on the edge
5:47 - The relationship of databases and models in personalization
7:11 - Latency
11:18 - Device constraints
16:42 - Encoding vs. decoding and LLM Metrics
19:14 - Optimizing LLMs for edge deployment
25:36 - SLM
32:39 - KVs
39:36 - KV compression and model architectures
47:16 - Hybrid AI
50:58 - Speculative decoding
1:06:01 - Self-speculative decoding
1:08:55 - Reasoning models
1:12:02 - Inference scaling
1:14:19 - Future directions

🔗 LINKS & RESOURCES
===============================
Why Qualcomm AI Orchestrator is the key to next generation AI experiences - https://www.qualcomm.com/news/onq/2024/10/why-qualcomm-ai-orchestrator-is-key-to-next-gen-ai-experiences
Recursive Speculative Decoding: Accelerating LLM Inference via Sampling Without Replacement - https://arxiv.org/abs/2402.14160
Direct Alignment of Draft Model for Speculative Decoding with Chat-Fine-Tuned LLMs
On Speculative Decoding for Multimodal Large Language Models - https://arxiv.org/abs/2404.08856

📸 Camera: https://amzn.to/3TQ3zsg
🎙️Microphone: https://amzn.to/3t5zXeV
🚦Lights: https://amzn.to/3TQlX49
🎛️ Audio Interface: https://amzn.to/3TVFAIq
🎚️ Stream Deck: https://amzn.to/3zzm7F5

Podcast

Tech

Technology

ML

AI

Machine Learning

Artificial Intelligence

Sam Charrington

data

science

computer science

deep learning

TWIML AI

Comment