Diffusion models are a key innovation with far-reaching impacts on multiple fields in machine learning, being the technology behind OpenAI's DALL-E and Sora, Google's Imagen, Stable Diffusion, Midjourney, and more. Developed initially for image generation, they only seem to be increasing in importance as recent releases like NVIDIA's Cosmos, Physical Intelligence's Pi Zero, Google's Veo, etc. all use the same diffusion principles.
Despite its pivotal role in modern AI, I have never found an explanation for how they work that helps me really understand them. This is the video I always wished I had: I present a different view of diffusion models as a gradient descent-like algorithm that stochastically optimizes for some notion of image quality, which is based on the "score matching" view of diffusion models (https://arxiv.org/pdf/1907.05600)
With this framing, diffusion models can be understood as treating image generation as an optimization problem, solving it using gradient descent, but *in test time* instead of train time. Given how useful gradient-based optimization is in train time (we use gradient descent for *literally* everything in deep learning), diffusion models elegantly demonstrate its power and utility in test time.
Timestamps:
0:00 - Intro/Recap/How you usually learn about diffusion models
2:37 - Intro to image space (where images live)
4:02 - Locations in image space are different possible images
5:08 - The structure of image space: sparseness and clustering
7:28 - Diffusion models as navigators of image space
8:43 - The real meaning of the diffusion model forward pass
11:12 - How diffusion models decide what image to generate
12:50 - Connections to probabilistic models
14:47 - Image generation as optimization problems, solvable using gradient descent
15:46 - Training diffusion models
16:34 - Geometric intuition of the noising/forward diffusion process
17:05 - Creating training data for diffusion models
18:01 - Diffusion models learn a "vector field" over image space
18:44 - Analogies, similarities, and differences with image classification
21:10 - Recap and key take-aways
22:56 - What's next
This video is designed to ease AI practitioners who are familiar with more common machine learning paradigms into the world of probabilistic models, while allowing non-technical people to get a glimpse into the key computational techniques that has driven the recent advances in image generation quality. As such, I have kept it light on the math, while still imparting the key intuitions and assumptions behind what makes diffusion models work so well.
#generativeai #sora #neuralnetworks #deeplearning #flux #diffusion