How Rotary Position Embedding Supercharges Modern LLMs

Jia-Bin Huang 9,626 6 months ago

Video Not Working? Fix It Now

Positional information is critical in transformers' understanding of sequences and their ability to generalize beyond training context length. In this video, we discuss - 1) Why attention mechanism in transformers is not sufficient - 2) Earlier attempt for injecting positional information (e.g., sinusoidal positional encoding) - 3) Rotary position embedding, and - 4) Techniques for long-context generalization and extension. Background on Transformer: https://www.youtube.com/watch?v=rcWMRA9E5RI References: - [Transformer] Attention Is All You Need https://arxiv.org/abs/1706.03762 - [RoPE] RoFormer: Enhanced Transformer with Rotary Position Embedding https://arxiv.org/abs/2104.09864 - [How is RoPE useful?] Round and Round We Go! What makes Rotary Positional Encodings useful? https://arxiv.org/abs/2410.06205 - [Controlled study] A Controlled Study on Long Context Extension and Generalization in LLMs https://arxiv.org/abs/2409.12181 Raw PowerPoint slides: https://www.dropbox.com/scl/fi/y43aw2v9aihe16nvy282o/storyboard.pptx?rlkey=kpbqxtlkw0kj8qgbqcg23a702&dl=0

Comment