➡️ Get Life-time Access to the ADVANCED-inference Repo (incl. inference scripts in this vid.): https://trelis.com/ADVANCED-inference/
➡️ Runpod Affiliate Link: https://runpod.io?ref=jmfkcdio
➡️ One-click GPU templates: https://github.com/TrelisResearch/one-click-llms
ERRATA:
At 57:45: I likely copied the ID of 70B (4xA40) instead of 405B (4xH100), so the results for 405B 4xH100 are incorrect.
VIDEO RESOURCES:
- Slides: https://docs.google.com/presentation/d/1di6NmxfM3aPWVAqunnlmQNNxwRC0RdEe4b6APlelm7U/edit?usp=sharing
- SGLang: https://github.com/sgl-project/sglang
- vLLM: https://github.com/vllm-project/vllm
- TGI: https://github.com/huggingface/text-generation-inference
- Nvidia NIM: https://docs.nvidia.com/nim/large-language-models/latest/introduction.html
OTHER TRELIS LINKS:
➡️ Trelis Newsletter: https://blog.Trelis.com
➡️ Trelis Resources and Support: https://Trelis.com/About
TIMESTAMPS:
0:00 How to pick a GPU and software for inference
0:44 Video Overview
1:51 Effect of Quantization on Quality
9:23 Effect of Quantization on Speed
14:03 Effect of GPU bandwidth relative to model size
17:49 Effect of de-quantization on inference speed
19:57 Marlin Kernels, AWQ and Neural Magic
23:20 Inference Software - vLLM, TGI, SGLang, NIM
25:22 Deploying one-click templates for inference
33:52 Testing inference speed for a batch size of 1 and 64
36:17 SGLang inference speed
37:55 vLLM inference speed
38:50 Text Generation Inference Speed
40:41 Nvidia NIM Inference Speed
42:13 Comparing vLLM, SGLang, TGI and NIM Inference Speed.
43:13 Comparing inference costs for A40, A6000, A100 and H100
45:36 Inference Setup for Llama 3.1 70B and 405B
48:33 Running inference on Llama 8B on A40, A6000, A100 and H100
51:10 Inference cost comparison for Llama 8B
52:33 Running inference on Llama 70B and 405B on A40, A6000, A100 and H100
55:19 Inference cost comparison for Llama 70B and 405B
1:00:14 OpenAI GPT4o Inference Costs versus Llama 3.1 8B, 70B, 405B
1:02:10 Final Inference Tips
1:03:50 Resources