MENU

Fun & Interesting

Distributed ML Talk @ UC Berkeley

Sourish Kundu 7,880 lượt xem 3 months ago
Video Not Working? Fix It Now

Here's a talk I gave to to Machine Learning @ Berkeley Club! We discuss various parallelism strategies used in industry when training large ML models at scale. The animations are from my previous YouTube video about parallelism strategies: https://www.youtube.com/watch?v=xkH8shGffRU

Papers & Resources Mentioned in Talk:
Breadth-First Pipeline Parallelism: https://arxiv.org/abs/2211.05953
MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs: https://arxiv.org/abs/2402.15627
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism: https://arxiv.org/abs/1909.08053
GPipe: Easy Scaling with Micro-Batch Pipeline Parallelism: https://arxiv.org/abs/1811.06965
ZeRO: Memory Optimizations Toward Training Trillion Parameter Models: https://arxiv.org/abs/1910.02054
PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel: https://arxiv.org/abs/2304.11277
NeRF-XL: Scaling NeRFs with Multiple GPUs: https://research.nvidia.com/labs/toronto-ai/nerfxl/
Sasha Rush's LLM Training Puzzle: https://github.com/srush/LLM-Training-Puzzles
Diffusion Models Are Real-Time Game Engines: https://gamengen.github.io/

*Please comment down below if I missed any!*

Timestamps:
0:00 - Introduction
0:39 - Scaling Dimensions
2:22 - About Me
3:19 - The GPU & Brief History Overview
6:22 - Matrix Multiplication
8:37 - Motivation for Parallelism
9:55 - Review of Basic Training Loop
11:05 - Data Parallelism
12:52 - NCCL
15:48 - Pipeline Parallelism
19:04 - Tensor Parallelism
20:46 - Back to DDP
22:13 - Adam Optimizer Review
23:11 - FSDP
26:11 - DeepSpeed
27:39 - Next Steps
29:30 - Galvatron Paper
30:24 - More Papers
32:40 - Orthogonal Optimizations
36:40 - How to Stay in Touch
36:55 - Questions
51:47 - Thank You!

Comment