GPTQ Quantization EXPLAINED

Oscar Savolainen 1,324 lượt xem 6 months ago

Video Not Working? Fix It Now

If you need help with anything quantization or ML related (e.g. debugging code) feel free to book a 30 minute consultation session! https://calendly.com/oscar-savolainen

I'm also available for long-term freelance work, e.g. for training / productionizing models, teaching AI concepts, etc.

Video Summary:
In this video we go over the GPTQ neural network quantization paper, and explain the whole thing. We explain the theory behind the use of the Hessian, the derived equations, the GPU optimizations, and the use of Cholesky decomposition.

Links:
GPTQ paper: https://arxiv.org/abs/2210.17323
Intro to quantization: https://www.youtube.com/watch?v=rzMs-wKQU_U Case of "power limited" GPUs: https://www.thonking.ai/p/strangely-matrix-multiplications
Video animations made with the help of the https://github.com/3b1b/manim library

Timestamps:
00:00 Intro
01:43 Motivation: why invent GPTQ in the first place?
05:47 History of papers
07:00 Basic idea of GPTQ
08:10 Small mistake in the GPTQ paper
10:27 Explanation of Hessian matrix
12:10 Talyor Series
15:10 How to derive the pick-row and update-weights equations
18:37 Gaussian elimination
19:57 Original contributions of GPTQ - computational optimization
23:12 Computational bottlenecks and lazy batch updates
29:00 Cholesky decomposition
31:58 Conclusion

PyTorch

Quantization

Neural Network Quantization

TensorFlow

ResNet

Tutorial

AI

ML

Machine Learning

Deep Learning

Coding

quantization techniques

GPTQ

LLM

Large Language Model

Llama

OpenAI

ChatGPT

GPT

Comment