Stable Diffusion Notes

The goal of any diffusion model is to create an image from random gaussian noise based on its training and the input parameters

Going from the Latent space to the Pixel space-It goes from numeric representations of inputs and training data(vectors) and then uses an encoder-decoder model to generate the corresponding output in pixels, this process is repeated until a final image is generated

U-Net Architecture

Understanding Diffusion Models and U-Net Architecture

Notes from YouTube Video: "Understanding Diffusion Models" by rupertai

Introduction to Diffusion Models:
- Diffusion models are generative models that iteratively transform data through a series of steps, gradually adding noise and then learning to reverse the process to generate new data.
- Inspired by non-equilibrium thermodynamics, where data distribution evolves over time.
Forward Diffusion Process:
- Data (e.g., an image) is progressively noised over several time steps.
- Each step involves adding Gaussian noise, making the data increasingly random.
Reverse Diffusion Process:
- The goal is to denoise the data step-by-step, learning the reverse of the forward diffusion process.
- This process involves training a neural network to predict and remove noise from the data.
Training the Model:
- The model is trained using pairs of noised data and original data, learning to predict the original data from its noised version.
- The loss function often used is Mean Squared Error (MSE) between the predicted and actual denoised data.
Applications:
- Image generation: Creating high-quality images from random noise.
- Image inpainting: Filling in missing parts of an image.
- Super-resolution: Enhancing the resolution of images.

U-Net Architecture

The U-Net architecture is a type of convolutional neural network (CNN) originally designed for biomedical image segmentation. It has since been widely adopted in various fields, including diffusion models, due to its powerful ability to capture spatial information and perform detailed image generation tasks.

Key Features of U-Net:

Encoder-Decoder Structure:
- Encoder (Contraction Path): Series of convolutional layers with down-sampling (pooling) to capture context and reduce spatial dimensions.
- Decoder (Expansion Path): Series of up-sampling (transposed convolutions) layers to restore spatial dimensions and enable precise localization.
Skip Connections:
- Direct connections between corresponding layers in the encoder and decoder paths.
- These connections help retain high-resolution features from early layers, aiding in precise image reconstruction.