Quantization vs Pruning vs Distillation: Optimizing NNs for Inference
Try Voice Writer - speak your thoughts and let AI handle the grammar: https://voicewriter.io
Four techniques to optimize the speed of your model's inference process:
0:38 - Quantization
5:59 - Pruning
9:48 - Knowledge Distillation
13:00 - Engineering Optimizations
References:
LLM Inference Optimization blog post: https://lilianweng.github.io/posts/2023-01-10-inference-optimization/
How to deploy your deep learning project on a budget: https://luckytoilet.wordpress.com/2023/06/20/how-to-deploy-your-deep-learning-side-project-on-a-budget/
Efficient deep learning survey paper: https://arxiv.org/abs/2106.08962
SparseDNN: https://arxiv.org/abs/2101.07948