Project 5: Fun With Diffusion Models


Yuqin Jiao

Part A: The Power of Diffusion Models

Overview for Part A

Part A is for using pre-trained UNet, implementing diffusion sampling loops, and using them for other tasks such as inpainting and creating optical illusions.

Part 0: Setup

For Part 0, I set the random seed to 180 to ensure consistent results across runs. Using the provided prompts — "an oil painting of a snowy mountain village," "a man wearing a hat," and "a rocket ship" — I sampled images from both stage 1 (64x64 resolution) and stage 2 (256x256 resolution). For each stage, I experimented with num_inference_steps set to 5, 20, and 40. The outputs demonstrated varying levels of detail and alignment with the prompts across inference steps, to be specific, with the increase of num_inference_steps value, the image are higher quality. Also, generally, the stage 2 output images are higher quality than the stage 1 output images.

Description of figure1_1
[num_inference_steps as 5] stage 1 output: an oil painting of a snowy mountain village
Description of figure1_2
[num_inference_steps as 5] stage 1 output: a man wearing a hat
Description of figure1_2
[num_inference_steps as 5] stage 1 output: a rocket ship

Description of figure1_1
[num_inference_steps as 5] stage 2 output: an oil painting of a snowy mountain village
Description of figure1_2
[num_inference_steps as 5] stage 2 output: a man wearing a hat
Description of figure1_2
[num_inference_steps as 5] stage 2 output: a rocket ship

Description of figure1_1
[num_inference_steps as 20] stage 1 output: an oil painting of a snowy mountain village
Description of figure1_2
[num_inference_steps as 20] stage 1 output: a man wearing a hat
Description of figure1_2
[num_inference_steps as 20] stage 1 output: a rocket ship

Description of figure1_1
[num_inference_steps as 20] stage 2 output: an oil painting of a snowy mountain village
Description of figure1_2
[num_inference_steps as 20] stage 2 output: a man wearing a hat
Description of figure1_2
[num_inference_steps as 20] stage 2 output: a rocket ship

Description of figure1_1
[num_inference_steps as 40] stage 1 output: an oil painting of a snowy mountain village
Description of figure1_2
[num_inference_steps as 40] stage 1 output: a man wearing a hat
Description of figure1_2
[num_inference_steps as 40] stage 1 output: a rocket ship

Description of figure1_1
[num_inference_steps as 40] stage 2 output: an oil painting of a snowy mountain village
Description of figure1_2
[num_inference_steps as 40] stage 2 output: a man wearing a hat
Description of figure1_2
[num_inference_steps as 40] stage 2 output: a rocket ship

Part 1: Sampling Loops

In part 1, I wrote my own "sampling loops" that use the pretrained DeepFloyd denoisers.

Part 1.1: Implementing the Forward Process

I implemented the forward(im, t) function, which generates a noisy image x_t by adding scaled Gaussian noise ϵ to a clean image x_0. The noise is generated using torch.randn_like(). I tested the function with the provided Campanile image at noise levels t=[250,500,750], progressively increasing the noise. The scaling factors sqrt(α_bar_t) and sqrt(1 - α_bar_t) were used to adjust the image and noise. The outputs showed a clear increase in noise as t increased.

Description of figure1_1
[test image] Berkeley Campanile
Description of figure1_2
[test image at noise level] Noisy Campanile at t=250
Description of figure1_2
[test image at noise level] Noisy Campanile at t=500
Description of figure1_2
[test image at noise level] Noisy Campanile at t=750

Part 1.2: Classical Denoising

I applied Gaussian blur filtering to remove noise from the noisy images at t=[250,500,750]. Using torchvision.transforms.functional.gaussian_blur, I experimented with kernel sizes and sigma values to produce denoised images. While the Gaussian blur reduced high-frequency noise, the results demonstrated the limitations of classical filtering, as it was unable to fully recover the details of the original image.

Description of figure1_2
[test image at noise level] Noisy Campanile at t=250
Description of figure1_2
[test image at noise level] Noisy Campanile at t=500
Description of figure1_2
[test image at noise level] Noisy Campanile at t=750
Description of figure1_1
[Classical Denoising] Gaussian Blur Denoising at t=250
Description of figure1_2
[Classical Denoising] Gaussian Blur Denoising at t=500
Description of figure1_2
[Classical Denoising] Gaussian Blur Denoising at t=750

Part 1.3: One-Step Denoising

I utilized the pretrained diffusion model (stage_1.unet) to denoise the noisy images. The UNet predicted the noise ϵ, which I subtracted from the noisy image x_t to estimate the original clean image x_0. I adhered to the diffusion equation x_0 = (x_t - sqrt(α_bar_t)*ϵ)/sqrt(α_bar_t). The results for t=[250,500,750] demonstrated a significant improvement in recovering the image, highlighting the superiority of the diffusion model over classical methods. The text prompt "a high quality photo" was used for conditioning the UNet during denoising.

Description of figure1_1
[original image]] Berkeley Campanile
Description of figure1_2
[test image at noise level] Noisy Campanile at t=250
Description of figure1_2
[test image at noise level] Noisy Campanile at t=500
Description of figure1_2
[test image at noise level] Noisy Campanile at t=750
Description of figure1_1
[One Step Denoising] Denoised estimate at t=250
Description of figure1_2
[One Step Denoising] Denoised estimate at t=500
Description of figure1_2
[One Step Denoising] Denoised estimate at t=750

Part 1.4: Iterative Denoising

I implemented iterative denoising using the pretrained diffusion model (stage_1.unet). The process starts with a highly noisy image at timestep t, and noise is progressively removed step by step using a custom schedule (strided_timesteps). At each step, the model estimates the noise and reduces it, gradually recovering the original image. The text prompt is set as "a high quality photo." To optimize the process, I skipped timesteps using a stride of 30, reducing computational cost while maintaining quality. The results were compared to one-step denoising and Gaussian blur. Iterative denoising produced significantly cleaner images, demonstrating the model's ability to effectively handle high noise levels.

Description of figure1_1
Noisy Campanile at t=90
Description of figure1_2
Noisy Campanile at t=240
Description of figure1_2
Noisy Campanile at t=390
Description of figure1_2
Noisy Campanile at t=540
Description of figure1_2
Noisy Campanile at t=690
Description of figure1_1
Original
Description of figure1_2
Iteratively Denoised Campanile
Description of figure1_2
One-Step Denoised Campanile
Description of figure1_2
Gaussian Blurred Campanile

Part 1.5: Diffusion Model Sampling

I used the pretrained diffusion model (stage_1.unet) to generate images from random noise. By setting i_start = 0 in the iterative_denoise function, the model denoised pure noise step by step. Five images were generated using the text prompt "a high quality photo" and displayed as results.

Description of figure1_1
Sample 1
Description of figure1_2
Sample 2
Description of figure1_2
Sample 3
Description of figure1_2
Sample 4
Description of figure1_2
Sample 5

Part 1.6: Classifier-Free Guidance (CFG)

I implemented Classifier-Free Guidance (CFG) using the diffusion model (stage_1.unet) to enhance image quality by combining conditional and unconditional noise estimates. With a CFG scale of 7, I generated five high-quality images from random noise, showcasing improved results compared to standard sampling techniques.

Description of figure1_1
Sample 1 with CFG
Description of figure1_2
Sample 2 with CFG
Description of figure1_2
Sample 3 with CFG
Description of figure1_2
Sample 4 with CFG
Description of figure1_2
Sample 5 with CFG

Part 1.7: Image-to-image Translation

I implemented image-to-image translation using the pretrained diffusion model (stage_1.unet) and Classifier-Free Guidance (CFG). Starting with a slightly noisy image, the model iteratively denoised it to create "edits" that gradually resemble the original image. Using noise levels [1,3,5,7,10,20], I generated edits with the prompt "a high-quality photo."

Description of figure1_1
SDEdit with i_start=1
Description of figure1_2
SDEdit with i_start=3
Description of figure1_2
SDEdit with i_start=5
Description of figure1_2
SDEdit with i_start=7
Description of figure1_2
SDEdit with i_start=10
Description of figure1_2
SDEdit with i_start=20
Description of figure1_2
Campanile
Description of figure1_1
SDEdit with i_start=1
Description of figure1_2
SDEdit with i_start=3
Description of figure1_2
SDEdit with i_start=5
Description of figure1_2
SDEdit with i_start=7
Description of figure1_2
SDEdit with i_start=10
Description of figure1_2
SDEdit with i_start=20
Description of figure1_2
Cube
Description of figure1_1
SDEdit with i_start=1
Description of figure1_2
SDEdit with i_start=3
Description of figure1_2
SDEdit with i_start=5
Description of figure1_2
SDEdit with i_start=7
Description of figure1_2
SDEdit with i_start=10
Description of figure1_2
SDEdit with i_start=20
Description of figure1_2
Playground

Part 1.7.1: Editing Hand-Drawn and Web Images

I used the pretrained diffusion model (stage_1.unet) with Classifier-Free Guidance (CFG) to transform non-realistic images, such as sketches and web images, into natural-looking images. Using the interaction tool provided, I edited one web image and two hand-drawn images by applying iterative denoising at noise levels [1,3,5,7,10,20]. The results demonstrate the model's ability to project creative edits onto the natural image manifold.

Description of figure1_1
[web image] Scene at i_start=1
Description of figure1_2
[web image] Scene at i_start=3
Description of figure1_2
[web image] Scene at i_start=5
Description of figure1_2
[web image] Scene at i_start=7
Description of figure1_2
[web image] Scene at i_start=10
Description of figure1_2
[web image] Scene at i_start=20
Description of figure1_2
[web image] Scene
Description of figure1_1
[drawn image] Pavilion at i_start=1
Description of figure1_2
[drawn image] Pavilion at i_start=3
Description of figure1_2
[drawn image] Pavilion at i_start=5
Description of figure1_2
[drawn image] Pavilion at i_start=7
Description of figure1_2
[drawn image] Pavilion at i_start=10
Description of figure1_2
[drawn image] Pavilion at i_start=20
Description of figure1_2
[drawn image] Pavilion
Description of figure1_1
[drawn image] Triangle at i_start=1
Description of figure1_2
[drawn image] Triangle at i_start=3
Description of figure1_2
[drawn image] Triangle at i_start=5
Description of figure1_2
[drawn image] Triangle at i_start=7
Description of figure1_2
[drawn image] Triangle at i_start=10
Description of figure1_2
[drawn image] Triangle at i_start=20
Description of figure1_2
[drawn image] Triangle

Part 1.7.2: Inpainting

I implemented inpainting using the pretrained diffusion model (stage_1.unet) with Classifier-Free Guidance (CFG), following the RePaint paper. Using a binary mask, I replaced parts of an image with new content while preserving the unmasked regions. The process iteratively added noise and denoised the image while respecting the mask constraints. The figures include the test image inpainted (with a given mask) and two additional images edited using corresponding masks.

Description of figure1_1
Campanile
Description of figure1_2
Mask
Description of figure1_2
Hole to Fill
Description of figure1_2
Campanile Inpainted
Description of figure1_1
Cube
Description of figure1_2
Mask
Description of figure1_2
Hole to Fill
Description of figure1_2
Cube Inpainted
Description of figure1_1
Playground
Description of figure1_2
Mask
Description of figure1_2
Hole to Fill
Description of figure1_2
Playground Inpainted

Part 1.7.3: Text-Conditional Image-to-image Translation

I implemented text-conditional image-to-image translation using the pretrained diffusion model (stage_1.unet) with Classifier-Free Guidance (CFG). By combining iterative denoising and text prompts, the model transforms noisy images to align with the text description while retaining features of the original image. The process was applied at noise levels [1,3,5,7,10,20]. The figures with different prompts: 1. Rocket Ship: Using the prompt "a rocket ship," the test image was guided to incorporate rocket-like campanile. 2. Pencil: Using the prompt "a pencil," the test image was translated to resemble a pencil-like cube. 3. Hipster Barista: Using the prompt "a photo of a hipster barista," the test image was edited to blend features of the original playground with the text prompt.

Description of figure1_1
[rocket ship] rocket ship at noise level 1
Description of figure1_2
[rocket ship] rocket ship at noise level 3
Description of figure1_2
[rocket ship] rocket ship at noise level 5
Description of figure1_2
[rocket ship] rocket ship at noise level 7
Description of figure1_2
[rocket ship] rocket ship at noise level 10
Description of figure1_2
[rocket ship] rocket ship at noise level 20
Description of figure1_2
[rocket ship] Campanile
Description of figure1_1
[pencil] pencil at noise level 1
Description of figure1_2
[pencil] pencil at noise level 3
Description of figure1_2
[pencil] pencil at noise level 5
Description of figure1_2
[pencil] pencil at noise level 7
Description of figure1_2
[pencil] pencil at noise level 10
Description of figure1_2
[pencil] pencil at noise level 20
Description of figure1_2
[pencil] Cube
Description of figure1_1
[hipster barista] barista at noise level 1
Description of figure1_2
[hipster barista] barista at noise level 3
Description of figure1_2
[hipster barista] barista at noise level 5
Description of figure1_2
[hipster barista] barista at noise level 7
Description of figure1_2
[hipster barista] barista at noise level 10
Description of figure1_2
[hipster barista] barista at noise level 20
Description of figure1_2
[hipster barista] Playground

Part 1.8: Visual Anagrams

I implemented Visual Anagrams using the pretrained diffusion model (stage_1.unet) with Classifier-Free Guidance (CFG). The first group of outputs combines two noise estimates guided by different prompts: "an oil painting of an old man" and "an oil painting of people around a campfire". The second group of outputs combines two noise estimates guided by different prompts: "a photo of a hipster barista" and "an oil painting of an old man". The third group of outputs combines two noise estimates guided by different prompts: "a pencil" and "a man wearing a hat". By flipping the image and corresponding noise estimate, averaging the results, and denoising iteratively, the model generates an image that appears differently when flipped upside down.

Description of figure1_1
[Prompt] An Oil Painting of an Old Man
Description of figure1_2
[Prompt] An Oil Painting of People around a Campfire
Description of figure1_1
[Prompt] A Photo of A Hipster Barista
Description of figure1_2
[Prompt] An Oil Painting of an Old Man
Description of figure1_1
[Prompt] A Pencil
Description of figure1_2
[Prompt] A Man Wearing A Hat

Part 1.9: Hybrid Images

I implemented Hybrid Images using the pretrained diffusion model (stage_1.unet) with Classifier-Free Guidance (CFG). By combining noise estimates from two prompts—"a lithograph of a skull" and "a lithograph of waterfalls"—the model creates a composite noise estimate. Low frequencies from one noise estimate are blended with high frequencies from the other using a Gaussian blur (kernel size 33, sigma 2). The final hybrid image appears as a skull from afar and transforms into waterfalls when viewed up close. I also tried another two groups of prompts, which are "a pencil" with "a rocket ship" and "an oil painting of people around a campfire" with "a photo of a dog".

Description of figure1_1
Hybrid image of a skull and a waterfall
Description of figure1_2
Hybrid image of a rocket ship and a pencil
Description of figure1_2
Hybrid image of an oil painting of people around a campfire and a dog

Part B: Diffusion Models from Scratch

Overview for Part B

Part B is for implementing diffusion model and training on MNIST dataset.

Part 1: Training a Single-Step Denoising UNet

Part 1.1: Implementing the UNet

I implemented a UNet-based architecture as the denoiser for the project. According to the instruction and images from website, the UNet features downsampling and upsampling blocks, enabling detailed reconstruction of noisy images. Key components include convolutional layers (Conv), pooling layers (Flatten/ Unflatten), and concatenation operations (Concat). The model uses BatchNorm for normalization and GELU as the activation function, with the hidden dimension set to 128 for the diffusion model's network.

Part 1.2: Using the UNet to Train a Denoiser

In this part1.2, the goal was to train the UNet to map noisy images z=x+σϵ to clean images x. The training data consisted of clean MNIST images and their corresponding noisy versions generated with noise levels σ=0.5. The model was optimized with an L2 loss to minimize the reconstruction error. Noisy images were dynamically generated during training to enhance generalization.

Description of figure1_2
Varying levels of noise on MNIST digits

Part 1.2.1: Training

The UNet was trained on the MNIST dataset for 5 epochs using a batch size of 256 and the Adam optimizer (learning rate 10^(−4)). Loss curves were tracked during training, and denoised outputs were visualized after the 1st and 5th epochs. Results demonstrated clear improvement in the model's ability to recover clean images, highlighting effective learning of denoising tasks.

Description of figure1_2
Training Loss Curve
Description of figure1_2
Results on digits from the test set after 1 epoch of training
Description of figure1_2
Results on digits from the test set after 5 epochs of training

Part 1.2.2: Out-of-Distribution Testing

To evaluate robustness, the trained UNet was tested on MNIST images noised with varying σ values ([0.0,0.2,0.4,0.5,0.6,0.8,1.0]). The denoiser performed well for noise levels close to the training distribution (σ=0.5), with gradually reduced performance as noise deviated further. This testing highlighted the model's limitations and its adaptability to unseen noise levels.

Description of figure1_2
Results on digits from the test set with varying noise levels
Description of figure1_2
Results on digits from the test set with varying noise levels

Part 2: Training a Diffusion Model

Part 2.1: Adding Time Conditioning to UNet

In this part, I extended the UNet architecture from Part 1 by incorporating time-conditioning to predict noise at various timesteps t. This involved normalizing the scalar t to [0, 1] and embedding it into the network using FCBlock layers. These embeddings modulated key components like unflatten and upconv layers, enabling the UNet to adapt its outputs based on t. This adjustment allowed the model to handle noise with varying variance over time, making it suitable for iterative denoising. The rest of the UNet structure remained unchanged from Part 1.

Part 2.2: Training the UNet

Using the modified UNet, I trained the model on the MNIST dataset with random noise added dynamically across 300 timesteps t. For each training step, I generated noisy images x_t using the DDPM schedule and optimized the UNet to predict the noise ϵ. The Adam optimizer with an exponentially decaying learning rate ensured stable convergence across 20 epochs. By conditioning the UNet on t, the model learned to denoise images iteratively, effectively reversing the noising process. The time-conditioned UNet demonstrated robust performance in noise prediction, producing clean results from noisy inputs.

Description of figure1_2
Time-Conditioned UNet training loss curve

Part 2.3: Sampling from the UNet

In part2.3, I implemented the DDPM sampling process based on Algorithm B.2 to generate images from noise using the time-conditioned UNet. The process starts with a pure noise tensor x_T and iteratively refines it across T=300 timesteps using predicted noise ϵ_θ. By leveraging precomputed schedules for β, α, and α_bar, I combined low-variance predictions and stochastic noise at each step to compute x_{t-1}. The implementation allows for intermediate visualizations and outputs final clean images. Results were compared after 5 and 20 training epochs, showcasing model progression.

Description of figure1_2
[Time Conditioned] Epoch 5
Description of figure1_2
[Time Conditioned] Epoch 20

Part 2.4: Adding Class-Conditioning to UNet

In this part, I extended the UNet to include class-conditioning by introducing two additional FCBlock layers for the class vector c, which is encoded as a one-hot vector. During training, c is occasionally dropped (10% of the time) for unconditional generation. The model predicts noise 𝜖_𝜃(x_t, t, c), conditioned on both time t and class c. Loss computation aligns with Algorithm B.3, which incorporates dropout masking for c. The training loss curve demonstrated progressive improvement across epochs.

Description of figure1_2
Class-conditioned UNet training loss curve

Part 2.5: Sampling from the Class-Conditioned UNet

In part2.5, the class-conditioned UNet was used to generate images based on specific digit classes (0-9) using classifier-free guidance. To enhance the quality of class-conditional results, a guidance scale γ=5.0 was applied. The model predicted the noise ϵ_θ(x_t,t,c), where t is the timestep and c is the class label. Sampling began with random noise (x_T∼N(0,I)), iteratively denoising it through T=300 steps using the guidance mechanism. Both unconditional (c=0) and conditional (c≠0) noise estimates were computed, and the final prediction combined these with the guidance scale. The output was clamped to [0, 1] for realistic results. Sampling visualizations included grids of digits, with 4 samples per class, after 5 and 20 epochs, showcasing improved generation quality over training.

Description of figure1_2
[Class Conditioned] Epoch 5
Description of figure1_2
[Class Conditioned] Epoch 20

Reflection

In this project, I learned how to use and train a diffusion model, specifically how to add time conditioning and class conditioning to UNet.

Bells and Whistles 1

For part 2 of Bells and Whistles in PartA: Create something cool with what you learned in this project, I made a hybrid image of a man sitting with a dog around a campfire.

Description of figure1_2
[Hybrid Image] A Man Sitting with A Dog around A Campfire