Deploying Fine-Tuned Models

Hamel Husain 2,367 9 months ago

Video Not Working? Fix It Now

We will discuss inference servers, backends and platforms like Replicate that you can host models on. This is lesson of 4 of 4 course on applied fine-tuning: 1. When & Why to Fine-Tune: https://youtu.be/cPn0nHFsvFg 2. Fine-Tuning w/Axolotl: https://youtu.be/mmsa4wDsiy0 3. Instrumenting & Evaluating LLMs: https://youtu.be/SnbGD677_u0 4. Deploying Fine-Tuned LLMs: https://youtu.be/GzEcyBykkdo 00:00 Overview 01:24 Recap on LoRAs 06:28 Performance vs. Cost 10:18 Many Projects Are Not Real-Time 13:56 Exploring LoRA Training Directory and Pushing to HF Hub 15:15 HuggingFace Inference Endpoints Demo 18:30 Considerations When Deploying Models 20:25 Simple vs. Advanced Model Serving 22:04 Kinds of Model Serving 26:20 Honeycomb Example on Replicate 31:04 Honeycomb Example Code Walkthrough 41:33 Deploying Language Models 46:07 What Makes LLMs Slow 50:44 Making LLMs Fast 52:11 Continuous Batching 56:09 Performance Metrics 01:03:52 Simplifying Model Deployment 01:06:47 Simplifying Deployments with Replicate 01:09:31 Replicate Walkthrough 01:14:32 Cog-vLLM for Local Development 01:20:19 Predibase’s History 01:24:44 LoRAX Motivation and Idea 01:29:54 Issues with Merging Adapters 01:32:21 Challenges with QLoRA 01:34:53 Dequantizing QLoRA Weights 01:35:48 Deployment Considerations 01:42:53 Speculative Decoding 01:47:10 Throughput vs. Latency 01:55:17 Improving Latency and Throughput 02:02:38 Deploying on Modal 02:07:44 Modal Demo 02:12:55 LLM Demo on Modal 02:19:29 OpenAI-Compatible Endpoint Demo on Modal 02:23:37 Q&A Session

Comment