Part A is for using pre-trained UNet, implementing diffusion sampling loops, and using them for other tasks such as inpainting and creating optical illusions.
For Part 0, I set the random seed to 180 to ensure consistent results across runs. Using the provided prompts — "an oil painting of a snowy mountain village," "a man wearing a hat," and "a rocket ship" — I sampled images from both stage 1 (64x64 resolution) and stage 2 (256x256 resolution). For each stage, I experimented with num_inference_steps set to 5, 20, and 40. The outputs demonstrated varying levels of detail and alignment with the prompts across inference steps, to be specific, with the increase of num_inference_steps value, the image are higher quality. Also, generally, the stage 2 output images are higher quality than the stage 1 output images.
In part 1, I wrote my own "sampling loops" that use the pretrained DeepFloyd denoisers.
I applied Gaussian blur filtering to remove noise from the noisy images at t=[250,500,750]. Using torchvision.transforms.functional.gaussian_blur, I experimented with kernel sizes and sigma values to produce denoised images. While the Gaussian blur reduced high-frequency noise, the results demonstrated the limitations of classical filtering, as it was unable to fully recover the details of the original image.
I utilized the pretrained diffusion model (stage_1.unet) to denoise the noisy images. The UNet predicted the noise ϵ, which I subtracted from the noisy image x_t to estimate the original clean image x_0. I adhered to the diffusion equation x_0 = (x_t - sqrt(α_bar_t)*ϵ)/sqrt(α_bar_t). The results for t=[250,500,750] demonstrated a significant improvement in recovering the image, highlighting the superiority of the diffusion model over classical methods. The text prompt "a high quality photo" was used for conditioning the UNet during denoising.
I implemented iterative denoising using the pretrained diffusion model (stage_1.unet). The process starts with a highly noisy image at timestep t, and noise is progressively removed step by step using a custom schedule (strided_timesteps). At each step, the model estimates the noise and reduces it, gradually recovering the original image. The text prompt is set as "a high quality photo." To optimize the process, I skipped timesteps using a stride of 30, reducing computational cost while maintaining quality. The results were compared to one-step denoising and Gaussian blur. Iterative denoising produced significantly cleaner images, demonstrating the model's ability to effectively handle high noise levels.
I used the pretrained diffusion model (stage_1.unet) to generate images from random noise. By setting i_start = 0 in the iterative_denoise function, the model denoised pure noise step by step. Five images were generated using the text prompt "a high quality photo" and displayed as results.
I implemented Classifier-Free Guidance (CFG) using the diffusion model (stage_1.unet) to enhance image quality by combining conditional and unconditional noise estimates. With a CFG scale of 7, I generated five high-quality images from random noise, showcasing improved results compared to standard sampling techniques.
I implemented image-to-image translation using the pretrained diffusion model (stage_1.unet) and Classifier-Free Guidance (CFG). Starting with a slightly noisy image, the model iteratively denoised it to create "edits" that gradually resemble the original image. Using noise levels [1,3,5,7,10,20], I generated edits with the prompt "a high-quality photo."
I used the pretrained diffusion model (stage_1.unet) with Classifier-Free Guidance (CFG) to transform non-realistic images, such as sketches and web images, into natural-looking images. Using the interaction tool provided, I edited one web image and two hand-drawn images by applying iterative denoising at noise levels [1,3,5,7,10,20]. The results demonstrate the model's ability to project creative edits onto the natural image manifold.
I implemented inpainting using the pretrained diffusion model (stage_1.unet) with Classifier-Free Guidance (CFG), following the RePaint paper. Using a binary mask, I replaced parts of an image with new content while preserving the unmasked regions. The process iteratively added noise and denoised the image while respecting the mask constraints. The figures include the test image inpainted (with a given mask) and two additional images edited using corresponding masks.
I implemented text-conditional image-to-image translation using the pretrained diffusion model (stage_1.unet) with Classifier-Free Guidance (CFG). By combining iterative denoising and text prompts, the model transforms noisy images to align with the text description while retaining features of the original image. The process was applied at noise levels [1,3,5,7,10,20]. The figures with different prompts: 1. Rocket Ship: Using the prompt "a rocket ship," the test image was guided to incorporate rocket-like campanile. 2. Pencil: Using the prompt "a pencil," the test image was translated to resemble a pencil-like cube. 3. Hipster Barista: Using the prompt "a photo of a hipster barista," the test image was edited to blend features of the original playground with the text prompt.
I implemented Visual Anagrams using the pretrained diffusion model (stage_1.unet) with Classifier-Free Guidance (CFG). The first group of outputs combines two noise estimates guided by different prompts: "an oil painting of an old man" and "an oil painting of people around a campfire". The second group of outputs combines two noise estimates guided by different prompts: "a photo of a hipster barista" and "an oil painting of an old man". The third group of outputs combines two noise estimates guided by different prompts: "a pencil" and "a man wearing a hat". By flipping the image and corresponding noise estimate, averaging the results, and denoising iteratively, the model generates an image that appears differently when flipped upside down.
I implemented Hybrid Images using the pretrained diffusion model (stage_1.unet) with Classifier-Free Guidance (CFG). By combining noise estimates from two prompts—"a lithograph of a skull" and "a lithograph of waterfalls"—the model creates a composite noise estimate. Low frequencies from one noise estimate are blended with high frequencies from the other using a Gaussian blur (kernel size 33, sigma 2). The final hybrid image appears as a skull from afar and transforms into waterfalls when viewed up close. I also tried another two groups of prompts, which are "a pencil" with "a rocket ship" and "an oil painting of people around a campfire" with "a photo of a dog".
Part B is for implementing diffusion model and training on MNIST dataset.
I implemented a UNet-based architecture as the denoiser for the project. According to the instruction and images from website, the UNet features downsampling and upsampling blocks, enabling detailed reconstruction of noisy images. Key components include convolutional layers (Conv), pooling layers (Flatten/ Unflatten), and concatenation operations (Concat). The model uses BatchNorm for normalization and GELU as the activation function, with the hidden dimension set to 128 for the diffusion model's network.
In this part1.2, the goal was to train the UNet to map noisy images z=x+σϵ to clean images x. The training data consisted of clean MNIST images and their corresponding noisy versions generated with noise levels σ=0.5. The model was optimized with an L2 loss to minimize the reconstruction error. Noisy images were dynamically generated during training to enhance generalization.
The UNet was trained on the MNIST dataset for 5 epochs using a batch size of 256 and the Adam optimizer (learning rate 10^(−4)). Loss curves were tracked during training, and denoised outputs were visualized after the 1st and 5th epochs. Results demonstrated clear improvement in the model's ability to recover clean images, highlighting effective learning of denoising tasks.
To evaluate robustness, the trained UNet was tested on MNIST images noised with varying σ values ([0.0,0.2,0.4,0.5,0.6,0.8,1.0]). The denoiser performed well for noise levels close to the training distribution (σ=0.5), with gradually reduced performance as noise deviated further. This testing highlighted the model's limitations and its adaptability to unseen noise levels.
In this part, I extended the UNet architecture from Part 1 by incorporating time-conditioning to predict noise at various timesteps t. This involved normalizing the scalar t to [0, 1] and embedding it into the network using FCBlock layers. These embeddings modulated key components like unflatten and upconv layers, enabling the UNet to adapt its outputs based on t. This adjustment allowed the model to handle noise with varying variance over time, making it suitable for iterative denoising. The rest of the UNet structure remained unchanged from Part 1.
Using the modified UNet, I trained the model on the MNIST dataset with random noise added dynamically across 300 timesteps t. For each training step, I generated noisy images x_t using the DDPM schedule and optimized the UNet to predict the noise ϵ. The Adam optimizer with an exponentially decaying learning rate ensured stable convergence across 20 epochs. By conditioning the UNet on t, the model learned to denoise images iteratively, effectively reversing the noising process. The time-conditioned UNet demonstrated robust performance in noise prediction, producing clean results from noisy inputs.
In part2.3, I implemented the DDPM sampling process based on Algorithm B.2 to generate images from noise using the time-conditioned UNet. The process starts with a pure noise tensor x_T and iteratively refines it across T=300 timesteps using predicted noise ϵ_θ. By leveraging precomputed schedules for β, α, and α_bar, I combined low-variance predictions and stochastic noise at each step to compute x_{t-1}. The implementation allows for intermediate visualizations and outputs final clean images. Results were compared after 5 and 20 training epochs, showcasing model progression.
In this part, I extended the UNet to include class-conditioning by introducing two additional FCBlock layers for the class vector c, which is encoded as a one-hot vector. During training, c is occasionally dropped (10% of the time) for unconditional generation. The model predicts noise 𝜖_𝜃(x_t, t, c), conditioned on both time t and class c. Loss computation aligns with Algorithm B.3, which incorporates dropout masking for c. The training loss curve demonstrated progressive improvement across epochs.
In part2.5, the class-conditioned UNet was used to generate images based on specific digit classes (0-9) using classifier-free guidance. To enhance the quality of class-conditional results, a guidance scale γ=5.0 was applied. The model predicted the noise ϵ_θ(x_t,t,c), where t is the timestep and c is the class label. Sampling began with random noise (x_T∼N(0,I)), iteratively denoising it through T=300 steps using the guidance mechanism. Both unconditional (c=0) and conditional (c≠0) noise estimates were computed, and the final prediction combined these with the guidance scale. The output was clamped to [0, 1] for realistic results. Sampling visualizations included grids of digits, with 4 samples per class, after 5 and 20 epochs, showcasing improved generation quality over training.
In this project, I learned how to use and train a diffusion model, specifically how to add time conditioning and class conditioning to UNet.
For part 2 of Bells and Whistles in PartA: Create something cool with what you learned in this project, I made a hybrid image of a man sitting with a dog around a campfire.