We will discuss inference servers, backends and platforms like Replicate that you can host models on.
This is lesson of 4 of 4 course on applied fine-tuning:
1. When & Why to Fine-Tune: https://youtu.be/cPn0nHFsvFg
2. Fine-Tuning w/Axolotl: https://youtu.be/mmsa4wDsiy0
3. Instrumenting & Evaluating LLMs: https://youtu.be/SnbGD677_u0
4. Deploying Fine-Tuned LLMs: https://youtu.be/GzEcyBykkdo
*00:00 Overview*
*01:24 Recap on LoRAs*
*06:28 Performance vs. Cost*
*10:18 Many Projects Are Not Real-Time*
*13:56 Exploring LoRA Training Directory and Pushing to HF Hub*
*15:15 HuggingFace Inference Endpoints Demo*
*18:30 Considerations When Deploying Models*
*20:25 Simple vs. Advanced Model Serving*
*22:04 Kinds of Model Serving*
*26:20 Honeycomb Example on Replicate*
*31:04 Honeycomb Example Code Walkthrough*
*41:33 Deploying Language Models*
*46:07 What Makes LLMs Slow*
*50:44 Making LLMs Fast*
*52:11 Continuous Batching*
*56:09 Performance Metrics*
*01:03:52 Simplifying Model Deployment*
*01:06:47 Simplifying Deployments with Replicate*
*01:09:31 Replicate Walkthrough*
*01:14:32 Cog-vLLM for Local Development*
*01:20:19 Predibase’s History*
*01:24:44 LoRAX Motivation and Idea*
*01:29:54 Issues with Merging Adapters*
*01:32:21 Challenges with QLoRA*
*01:34:53 Dequantizing QLoRA Weights*
*01:35:48 Deployment Considerations*
*01:42:53 Speculative Decoding*
*01:47:10 Throughput vs. Latency*
*01:55:17 Improving Latency and Throughput*
*02:02:38 Deploying on Modal*
*02:07:44 Modal Demo*
*02:12:55 LLM Demo on Modal*
*02:19:29 OpenAI-Compatible Endpoint Demo on Modal*
*02:23:37 Q&A Session*