Serve Multiple LoRA Adapters on a Single GPU

Trelis Research 1,652 lượt xem 4 months ago

Video Not Working? Fix It Now

➡️ Lifetime access to ADVANCED-inference Repo (incl. future additions): https://trelis.com/ADVANCED-inference/
➡️ FineTuneHost.com Waiting List: This has now been closed and the hosted service has been discontinued. Fill out a consulting inquiry on Trelis.com for support.
➡️ Thumbnail made with this tutorial: https://youtu.be/ThKYjTdkyP8

OTHER TRELIS LINKS:
➡️ Trelis Newsletter: https://blog.Trelis.com
➡️ Other Products from Trelis: https://Trelis.com/

VIDEO LINKS:
- Slides: https://docs.google.com/presentation/d/1KDyawYmtzV9zEh2L3_xq0sE_Mecyz-G_CEffkR4ZmRA/edit?usp=sharing
- BASIC-inference repo: https://github.com/trelisresearch/basic-inference
- vLLM LoRA docs: https://docs.vllm.ai/en/latest/usage/lora.html
- LoRA X: https://github.com/predibase/lorax

TIMESTAMPS:
00:00 - Introduction to serving multiple models on GPU
00:15 - Overview of using LoRA adapters as clip-ons
00:53 - Video structure overview
01:08 - Theory of LoRA for inference
02:09 - Explanation of LoRA (Low Rank Adapters)
04:00 - Benefits of using LoRA for training
05:10 - Practical implementation of LoRA loading
06:20 - GPU VRAM and model loading explanation
08:51 - Managing adapter downloads and storage
10:30 - Basic LoRaX Implementation
14:40 - Setting up the environment
19:20 - Running inference with LoRaX
23:40 - Setting up SSH connection for Runpod
29:17 - Advanced vLLM Implementation
34:40 - Building the proxy server
39:40 - Redis implementation for adapter management
44:20 - Starting the server
48:40 - Testing the service
52:54 - FineTuneHost.com service demonstration
56:17 - Conclusion and resource overview

Comment