Deep Dive into Inference Optimization for LLMs with Philip Kiely

Software Huddle 579 lượt xem 5 months ago

Video Not Working? Fix It Now

Today we have Philip Kiely from Baseten on the show. Baseten is a Series B startup focused on providing infrastructure for AI workloads.

We go deep on Inference Optimization. We cover choosing a model, discuss the hype around Compound AI, choosing an Inference Engine, Optimization Techniques like Quantization and Speculative Decoding all the way down to your GPU choice.

Timestamps

01:16 Start
05:43 Why focus on Inference?
11:15 Model Selection
16:52 Saving Costs
20:07 Saturating a GPU
21:28 When does it makes sense to fine tune
23:12 Compound AI
29:18 Performance
31:09 Why is inference slow
33:28 Techniques to optimize inference
50:54 Choice of GPUs
58:19 Programming Language Choice
59:48 Quick Fire

Comment