Positional information is critical in transformers' understanding of sequences and their ability to generalize beyond training context length.
In this video, we discuss
- 1) Why attention mechanism in transformers is not sufficient
- 2) Earlier attempt for injecting positional information (e.g., sinusoidal positional encoding)
- 3) Rotary position embedding, and
- 4) Techniques for long-context generalization and extension.
Background on Transformer: https://www.youtube.com/watch?v=rcWMRA9E5RI
References:
- [Transformer] Attention Is All You Need
https://arxiv.org/abs/1706.03762
- [RoPE] RoFormer: Enhanced Transformer with Rotary Position Embedding
https://arxiv.org/abs/2104.09864
- [How is RoPE useful?] Round and Round We Go! What makes Rotary Positional Encodings useful?
https://arxiv.org/abs/2410.06205
- [Controlled study] A Controlled Study on Long Context Extension and Generalization in LLMs
https://arxiv.org/abs/2409.12181
Raw PowerPoint slides: https://www.dropbox.com/scl/fi/y43aw2v9aihe16nvy282o/storyboard.pptx?rlkey=kpbqxtlkw0kj8qgbqcg23a702&dl=0