Paper Reading: Small-scale proxies for large-scale Transformer training instabilities

vishal 31 lượt xem 1 week ago

Video Not Working? Fix It Now

In this video I heavily use Claude Projects to help me understand concepts as I read through the "Small-scale proxies for large-scale Transformer training instabilities" paper by Worstman, et al. This paper talks about how qk-layernorm and z-loss regularization mitigates two training instabilities (attention logit growth and output logit divergence) and makes the training more robust to learning rate diversity. They also show that you can predict large model (4.8B) training instability based on max attention logit increase with model size. They also show that training instability occurs at lower learning rates when increasing model size (especially depth). Finally, they show that using a smaller AdamW epsilon stabilized parameter updates even when gradient norms decrease.

Arxiv link: https://arxiv.org/abs/2309.14322

Comment