Try Voice Writer - speak your thoughts and let AI handle the grammar: https://voicewriter.io The KV cache is what takes up the bulk of the GPU memory during inference for large language models like GPT-4. Learn about how the KV cache works in this video! 0:00 - Introduction 1:15 - Review of self-attention 4:07 - How the KV cache works 5:55 - Memory usage and example Further reading: * Speeding up the GPT - KV cache (https://www.dipkumar.dev/becoming-the-unbeatable/posts/gpt-kvcache/) * Transformer Inference Arithmetic (https://kipp.ly/transformer-inference-arithmetic/) * Efficiently Scaling Transformer Inference (https://arxiv.org/pdf/2211.05102.pdf)