Stable Diffusion Masterclass - Theory, Code & Application

Deep Learning Course - Level: Advanced

Intuitive Intro to Image Generation with Latent Diffusion Models


expand_more chevron_left


expand_more chevron_left

Intuitive Intro to Image Generation with Latent Diffusion Models

We now have a general idea about what latent diffusion models are and the various components involved to generate an image.

Before we get into the technical details for how exactly this image generation occurs, let's first go through an intuitive introduction to understand what's going on to generate images in this way.

Motivation for Image Generation Using a Classifier

Let's consider an image classifier. As we know, a classifier will accept an image as input, and as output will assign probabilities among a set of categories that correspond to where the input image best belongs.

Suppose we're dealing with a grayscale image dataset of gym equipment. If we have this classifier that has been trained to take an image as input and output the probability that it's gym equipment, then we can use this model to generate images of gym equipment. How?

The upcoming explanation is inspired by a lecture given by Jeremy Howard from and provides a very intuitive motivation for understanding for what's going on when a diffusion model generates an image.

Suppose we pass a 28x28 noisy image to the model, and it outputs a very low probability that the image is a piece of gym equipment.

We could then iterate over each of the 784 pixels in this image, making each pixel either a little bit darker or lighter each time, and observe how that affects the model's output. Does it make the probability of the image being a piece of gym equipment higher or lower?

By observing the output probability after each iteration of each pixel, we can find which direction to incrementally move each pixel so that the model continues to increase the probability that it assigns to the image being a piece of gym equipment.

We're essentially incrementally refining the noisy image until it becomes what actually appears to be a piece of gym equipment. Now, rather than us manually doing this image refining, consider having a network learn in which direction and by how much to move the pixels.

This concept is similar to updating the weights in a neural network until we get the desired output. In this case, we're having a network learn how to update the data (pixel values) until we reach a desirable output.

Using a Neural Network to Predict Noise

Now we have the goal to create and train a neural network that will learn which pixels to update and in which direction in order to make a supplied noisy image look more like a piece of gym equipment.

Per usual, first we need to obtain a training set consisting of images of gym equipment. We'll then add various amounts of noise to the images. The amount of noise added to any given image will vary from being just a slight amount to being so much that it only looks like random noise.

We'll pass these noisy images as input to the network for training. As output, we want the network to predict the noise that is present in the image.

We then do the normal network training by calculating the loss between the outputted noise predicted by the network and the actual noise present in the input image, calculate the gradient of the loss with respect to the weights of the network, update the weights with the gradients, and iteratively repeat this process allowing the network to get better and better at predicting the noise.

If the network can accurately predict the noise in a given image, then we can just subtract this noise from the original noisy image, and we have a clear image.

Once we have this trained network, we can pass it random noise with no actual image "hidden" underneath during inference. This network has been trained to identify noise present in images of gym equipment, so during inference, it will output what it thinks is noise in the random noise image, leaving behind pixels in the noisy image that ultimately appear as a piece of gym equipment. Once we subtract the predicted noise from the original noise, we're left with a new, generated image of gym equipment.

In practice, the type of network we use to do this process is called a U-Net, which is the main network component of Stable Diffusion. We'll expand more on U-Net itself, along with its network architecture in a later lesson.

Other components and preprocessing

Now we have an intuitive idea for how diffusion models are trained to generate images. If we train the model on noisy images of gym equipment, for example, the end goal is that the model will ultimately be able to accept an image of random noise, and generate an image of a piece of gym equipment from it.

In practice, however, the datasets of images that popular diffusion models are trained on consists of all different types of categories. So, how do we guide the model into generating an image that we want, rather than just a random image from the number of categories for which it has been trained?

We do this by passing a text prompt to the model, along with an image consisting of random noise. This will "guide" the model into creating an image from the random noise of what is described in the text. We use a pre-trained text encoder to encode the text into embeddings, which we'll learn more about later.

Additionally, in practice, the image is compressed before being passed to the model through the encoder portion of an autoencoder. Recall, these compressed images are referred to as latents.

The model will then output a compressed (latent) version of the output image, which will then be decoded using the decoder part of the autoencoder to enlarge it to a full-size image. Again, this process will be elaborated on in a future lesson.

We should now have an intuitive understanding how by training a model to predict the noise in noisy images ultimately leads to being able to generate an image from random noise.


expand_more chevron_left
deeplizard logo DEEPLIZARD Message notifications

Quiz Results


expand_more chevron_left
We now have a general idea about what latent diffusion models are and the various components involved to generate an image. πŸ’₯🦎 DEEPLIZARD COMMUNITY RESOURCES 🦎πŸ’₯ πŸ‘‹ Hey, we're Chris and Mandy, the creators of deeplizard! πŸ‘€ CHECK OUT OUR VLOG: πŸ”— πŸ’ͺ CHECK OUT OUR FITNESS CHANNEL: πŸ”— 🧠 Use code DEEPLIZARD at checkout to receive 15% off your first Neurohacker order: πŸ”— ❀️🦎 Special thanks to the following polymaths of the deeplizard hivemind: Mano Prime πŸ‘€ Follow deeplizard: Our vlog: Fitness: Facebook: Instagram: Twitter: Patreon: YouTube: πŸŽ“ Deep Learning with deeplizard: AI Art for Beginners - Deep Learning Dictionary - Deep Learning Fundamentals - Learn TensorFlow - Learn PyTorch - Natural Language Processing - Reinforcement Learning - Generative Adversarial Networks - Stable Diffusion Masterclass - πŸŽ“ Other Courses: DL Fundamentals Classic - Deep Learning Deployment - Data Science - Trading - πŸ›’ Check out products deeplizard recommends on Amazon: πŸ”— πŸ“• Get a FREE 30-day Audible trial and 2 FREE audio books using deeplizard's link: πŸ”— 🎡 deeplizard uses music by Kevin MacLeod πŸ”— ❀️ Please use the knowledge gained from deeplizard content for good, not evil.


expand_more chevron_left
deeplizard logo DEEPLIZARD Message notifications

Update history for this page

Did you know you that deeplizard content is regularly updated and maintained?

  • Updated
  • Maintained

Spot something that needs to be updated? Don't hesitate to let us know. We'll fix it!

All relevant updates for the content on this page are listed below.