Quantizing LLMs - How & Why (8-Bit, 4-Bit, GGUF & More)
Quantizing models for maximum efficiency gains!
Resources:
Model Quantized: https://huggingface.co/AdamLucek/Orpo-Llama-3.2-1B-15k
Quantization Colab Notebook: https://colab.research.google.com/drive/1NlHlHU-fdubXcuZ08eb7zpaidF7388r6?usp=sharing
HF 8-bit Blog: https://huggingface.co/blog/hf-bitsandbytes-integration
HF 4-Bit Blog: https://huggingface.co/blog/4bit-transformers-bitsandbytes
GGUF Overview: https://huggingface.co/docs/hub/gguf
Llama.cpp: https://github.com/ggerganov/llama.cpp/tree/master
GGUF Model Made in Video: https://huggingface.co/AdamLucek/Orpo-Llama-3.2-1B-15k-Q4_K_M-GGUF
Maxime Labonne Quantization Blog: https://mlabonne.github.io/blog/posts/Introduction_to_Weight_Quantization.html
Chapters:
00:00 - What Is Quantization?
02:19 - How Are Weights Stored?
03:22 - What is Binary?
06:26 - What are Floating Point Numbers?
10:38 - What Data Types are Used for LLMs?
12:02 - Does Quantization Negatively Affect LLMs?
15:08 - Code: Quantizing with BitsAndBytes
17:34 - Code: Comparing Quantized Layers
18:36 - Code: Comparing Text Generation
21:57 - Code: GGUF Quantization Overview
23:41 - Code: Quantizing with Llama.cpp
25:44 - Final Thoughts on Quantization
#ai #coding #deeplearning