MENU

Fun & Interesting

Deploying Fine-Tuned Models

Hamel Husain 2,367 9 months ago
Video Not Working? Fix It Now

We will discuss inference servers, backends and platforms like Replicate that you can host models on. This is lesson of 4 of 4 course on applied fine-tuning: 1. When & Why to Fine-Tune: https://youtu.be/cPn0nHFsvFg 2. Fine-Tuning w/Axolotl: https://youtu.be/mmsa4wDsiy0 3. Instrumenting & Evaluating LLMs: https://youtu.be/SnbGD677_u0 4. Deploying Fine-Tuned LLMs: https://youtu.be/GzEcyBykkdo *00:00 Overview* *01:24 Recap on LoRAs* *06:28 Performance vs. Cost* *10:18 Many Projects Are Not Real-Time* *13:56 Exploring LoRA Training Directory and Pushing to HF Hub* *15:15 HuggingFace Inference Endpoints Demo* *18:30 Considerations When Deploying Models* *20:25 Simple vs. Advanced Model Serving* *22:04 Kinds of Model Serving* *26:20 Honeycomb Example on Replicate* *31:04 Honeycomb Example Code Walkthrough* *41:33 Deploying Language Models* *46:07 What Makes LLMs Slow* *50:44 Making LLMs Fast* *52:11 Continuous Batching* *56:09 Performance Metrics* *01:03:52 Simplifying Model Deployment* *01:06:47 Simplifying Deployments with Replicate* *01:09:31 Replicate Walkthrough* *01:14:32 Cog-vLLM for Local Development* *01:20:19 Predibase’s History* *01:24:44 LoRAX Motivation and Idea* *01:29:54 Issues with Merging Adapters* *01:32:21 Challenges with QLoRA* *01:34:53 Dequantizing QLoRA Weights* *01:35:48 Deployment Considerations* *01:42:53 Speculative Decoding* *01:47:10 Throughput vs. Latency* *01:55:17 Improving Latency and Throughput* *02:02:38 Deploying on Modal* *02:07:44 Modal Demo* *02:12:55 LLM Demo on Modal* *02:19:29 OpenAI-Compatible Endpoint Demo on Modal* *02:23:37 Q&A Session*

Comment