Speculative Decoding: When Two LLMs are Faster than One

Efficient NLP 21,112 2 years ago

Video Not Working? Fix It Now

Try Voice Writer - speak your thoughts and let AI handle the grammar: https://voicewriter.io Speculative decoding (or speculative sampling) is a new technique where a smaller LLM (the draft model) generates the easier tokens which are then verified by a larger one (the target model). This make the generation faster computation without sacrificing accuracy. 0:00 - Introduction 1:00 - Main Ideas 2:27 - Algorithm 4:48 - Rejection Sampling 7:52 - Why sample (q(x) - p(x))+ 10:55 - Visualization and Results Deepmind Paper: https://arxiv.org/abs/2302.01318 Google Paper: https://arxiv.org/abs/2211.17192

Comment