What makes Stable Diffusion and DallE-2 so good?

Unpacking the intuition behind the state of the art in Artificial Intelligence and Machine learning.

Why are diffusion models important?

The latest hotness in the ML/AI field are diffusion models. While the theory has been around for a few years at this point, the demos and subsequent public release of DallE-2, MidJourney and StableDiffusion in 2022 have created quite a stir. No model/architecture in ML history has made such a rapid transition from technical conference papers to monetized products used by thousands of people. The reason this has happened is because generative diffusion models beat all previously known methods to produce extremely high resolution, detailed images that almost pass for human art.

How do diffusion models work?

Diffusion models are generative models. During training, noise is added to clean to create noisy input data. Then, the model learns to predict the noise and denoise the input to recover the clean images. During inference, a diffusion model simply generates new clean images from pure noise inputs.

I highly recommend this technical guide by Calvin Luo and this visual guide by Jay Alammar to understand more about how diffusion models works. The key novel aspects in recent diffusion models are -

a) There is no latent embedding space in diffusion models that contains the information of the images in a compressed vector space.

b) Diffusion models produce slightly better images from noisy images iteratively over many steps instead of doing it in 1 step.

c) Diffusion models are trained using a prespecified Gaussian noise prior.

d) The model does not predict the clean image, rather it predicts the noise that when subtracted out of the noisy image would give the clean image.

Intuition: why is diffusion so powerful?

c) and d).

The differences a) and b) mentioned above definitely have a part to play, but in principle both of those are compatible with earlier generative models such as GANs or VAEs. The secret sauce of the diffusion model lies in the two inter-related points c) and d), so lets explore them further.

Diffusion models are trained using a prespecified (Gaussian) noise prior.

In VAEs, the information needed to generate a new image comes from two places, the embedding vector and the decoder model. The problem is the vector space and decoder are constrained independently and the combination is only constrained in the neighborhood of the training examples, the rest of the space is unconstrained. Moreover training with the same data for more steps is useless because the constraints remains the same. During image generation, if the input embedding vector is near a training set example, the decoder can generate a good image; but move the input embedding vector further away and the output images become hazy and vague. Therefore VAEs cannot produce high quality images that are not in the training set.

In contrast, in diffusion models, all the information for the generative process comes from the model. The Gaussian noise prior has no information, it is simply a random seed for the generative step of the model. During training, the model is forced to learn that any random Gaussian prior must map to the same distribution as the training data. The randomness (i.e. no constraint) of the prior forces the model to be extremely constrained. Moreover training for longer duration keeps sampling new Gaussian priors which must again map to a good images, thereby constraining the model further. Of course, overtraining can diminish the model's ability to create images outside the training set.

Now you might think, don't GANs also use a random prior? That's exactly right and that is why GANs also outperform VAEs. The reason diffusion models beat GANs is due to the second point.

The model does not predict the clean image, it predicts the noise that when subtracted out of the noisy image would give the clean image.

From a theoretical perspective, given noise+clean image, predicting the clean image is equivalent to predicting the noise, either way the model has to separate the original image into the same two components. In practice, however, the two are very different.

The pixel values of images is not evenly distributed. Every image has a different distribution of colors, almost always skewed to a few colors. Predicting an image needs the model to retain and propagate this non-normal distribution that changes with every image. The noise on the other hand is a perfectly well behaved Gaussian centered at 0. Predicting this noise effectively normalizes the output for all images. This has the same effect as normalizing the input - the losses are more stable and converge faster, the model has better coverage over the outliers of the distribution and the architecture can spend more of its capacity on the higher resolution features and diversity. This is why diffusion models produce better quality images with greater diversity than GANs.

Of course GANs have other practical problems such as unstable training objectives which makes them even less desirable, but here I am assuming we are able to get those working and can train a very good GAN.

-------------------------------------------------------------------------------------

There you have it, my intuitions for why diffusion models are the reigning champions of image generation. For more thoughts on AI/ML, subscribe below and follow me on Twitter.