Diffusion Models — Gen AI · Engineering Playbook

Origin and intuition#

For most of deep learning’s history, image generation meant GANs (generative adversarial networks). A generator network produced images; a discriminator network tried to distinguish them from real images; the two trained in opposition. GANs produced state-of-the-art image quality from ~2014 to ~2021 but had notorious training instability — mode collapse, gradient pathologies, careful hyperparameter tuning per dataset. The other generative family was VAEs (variational autoencoders), which trained stably but produced blurry samples.

Diffusion models, formalized by Sohl-Dickstein et al. in 2015 and made practical by Ho, Jain, and Abbeel in 2020 (“Denoising Diffusion Probabilistic Models” — DDPM), replaced both. The core idea is to define generation as the reverse of a known noising process. Start with a real image. Add a tiny bit of Gaussian noise. Repeat ~1000 times. After enough steps, the image is indistinguishable from pure Gaussian noise — the forward process has destroyed all information.

The reverse process, then, is: start with pure noise, and progressively remove noise to recover an image. If you can train a network to estimate how much noise was added at each step (or equivalently, what the original image was), you can run that network 1000 times to generate a new image from a fresh noise sample. The trick is that each step is easy — denoising a slightly noisy image is a much simpler problem than generating an image from scratch. The model only needs to learn the local structure of the noising process at each level of noise; the global generation emerges from composing many easy local steps.

This decomposition turned out to be far easier to train than GANs and produced higher-quality samples than VAEs. By 2022, diffusion was the dominant approach to image generation: Stable Diffusion (latent diffusion, Rombach et al.), DALL·E 2 (CLIP-conditioned diffusion in pixel space), Imagen, Midjourney. By 2024 it had extended to video (Sora, Runway Gen-3, Veo), audio (AudioLDM, Stable Audio), 3D (DreamFusion), molecule generation, and protein design (RFdiffusion).

Inputs and outputs#

The interface is best understood in two phases:

Forward (noising) process — defined, not learned. Given a data sample x_0, define a sequence x_0, x_1, ..., x_T where each x_t is x_{t-1} plus a small amount of Gaussian noise. With a carefully chosen noise schedule β_1, ..., β_T, you can show that x_T is essentially pure Gaussian noise — the data distribution has been “diffused” into the Gaussian.

Importantly, sampling x_t directly from x_0 is closed-form (no need to iterate):

x_t = sqrt(alpha_bar_t) · x_0 + sqrt(1 - alpha_bar_t) · ε,   ε ~ N(0, I)

where alpha_bar_t is a cumulative product of (1 - β_i). This is what makes training tractable — you can sample a random timestep t for each training example, jump straight to x_t, and never iterate the forward chain.

Reverse (denoising) process — learned. A neural network ε_θ(x_t, t) predicts the noise that was added to produce x_t from x_{t-1} (or equivalently, predicts the clean x_0, or the velocity — these reparameterizations are mathematically equivalent). At inference time, start with x_T ~ N(0, I) and iteratively apply the learned reverse step T times to produce x_0.

The model input is (x_t, t) — the noisy image and the timestep. The model output is the predicted noise (same shape as x_t). For conditional generation (text-to-image), the model additionally takes a conditioning vector — typically a text embedding from a frozen CLIP or T5 encoder — and the architecture cross-attends to it.

Architecture diagram#

The two dominant architectural backbones for diffusion are U-Net and DiT (Diffusion Transformer).

U-Net backbone (the original). A symmetric encoder-decoder convolutional network with skip connections between matching-resolution layers. The model takes a noisy image, downsamples through several resolution stages while increasing channel count, then upsamples back to the original resolution. Skip connections preserve spatial detail; attention layers (self-attention within feature maps, cross-attention to text conditioning) sit at the lower-resolution stages where the receptive field can cover the whole image.

Input (noisy image, x_t)       Time embedding t
   │                              │
   ▼                              ▼
 ┌─────────────────────────────────────────┐
 │ Conv stem                                │
 └─────────────────────────────────────────┘
   │
   ▼
 ┌───────────────┐                         ┌───────────────┐
 │ Down block 1  │──── skip connection ───▶│  Up block 1   │
 │ (64×64, 320c) │                         │ (64×64, 320c) │
 └───────────────┘                         └───────────────┘
   │                                              ▲
   ▼                                              │
 ┌───────────────┐                         ┌───────────────┐
 │ Down block 2  │──── skip connection ───▶│  Up block 2   │
 │ (32×32, 640c) │                         │ (32×32, 640c) │
 └───────────────┘                         └───────────────┘
   │                                              ▲
   ▼                                              │
 ┌───────────────┐                         ┌───────────────┐
 │ Down block 3  │──── skip connection ───▶│  Up block 3   │
 │ (16×16,1280c) │  + cross-attn to text   │ (16×16,1280c) │
 └───────────────┘                         └───────────────┘
   │                                              ▲
   ▼                                              │
 ┌──────────────────────────────────────────┐
 │ Middle block (8×8, 1280c)                 │
 │  self-attn + cross-attn to text condition │
 └──────────────────────────────────────────┘
   │
   ▼
 Output: predicted noise (same shape as x_t)

DiT (Diffusion Transformer) backbone. Replace the U-Net with a plain transformer over image patches, the same way ViT replaced CNNs for image classification. Each patch becomes a token; timestep and conditioning enter as modulating signals (adaLN-zero in the original DiT paper). DiT scales better than U-Net at very large model sizes and is the backbone under modern systems like Stable Diffusion 3 and Sora.

Noisy image x_t  ──▶  Patchify (e.g., 2×2 patches)  ──▶  N tokens
                                                          │
Timestep t  ──▶  Embed  ──┐                              │
                          ├──▶  adaLN modulation  ──────▶│
Condition c ──▶  Embed  ──┘                              │
                                                          ▼
                                              ┌──────────────────────┐
                                              │  Transformer block 1  │
                                              │   (self-attn + MLP)   │
                                              └──────────────────────┘
                                                          │
                                                          ⋮ × L
                                                          ▼
                                              ┌──────────────────────┐
                                              │  Final layer          │
                                              │  → predicted noise    │
                                              └──────────────────────┘
                                                          │
                                                          ▼
                                              Reshape patches to image

Latent diffusion (Stable Diffusion). Run the entire diffusion process not in pixel space but in the latent space of a pretrained VAE. Encode a 512×512 image to a 64×64×4 latent, diffuse on that, decode back to pixels at the end. This is ~64× cheaper than pixel-space diffusion at the same output resolution and is the reason high-resolution diffusion became practical on consumer GPUs.

Training objective#

The simplified DDPM objective is mean squared error between the predicted noise and the actual noise that was added:

L = E_{t, x_0, ε}  [ || ε - ε_θ(x_t, t) ||² ]

For each training step: sample a real image x_0, sample a timestep t uniformly, sample noise ε ~ N(0, I), form x_t using the closed-form forward marginal, ask the model to predict ε from (x_t, t), compute MSE.

This is one of the cleanest training objectives in deep learning. No adversarial dynamics, no minimax, no balance between two networks. Just regression. The training stability is dramatically better than GANs — diffusion models train monotonically, and bigger / longer training reliably gives better samples.

For conditional generation (text-to-image), classifier-free guidance (CFG) is standard. During training, drop the conditioning with probability ~10% so the model learns both ε_θ(x_t, t, c) (conditional) and ε_θ(x_t, t, ∅) (unconditional). At inference, sample with:

ε_guided = ε_θ(x_t, t, ∅) + w · (ε_θ(x_t, t, c) - ε_θ(x_t, t, ∅))

The guidance weight w (typically 5-15 for text-to-image) trades sample diversity for prompt adherence. CFG is what made text-to-image diffusion actually follow prompts — without it, prompts have weak influence on samples.

Score matching equivalence. The DDPM objective is mathematically equivalent to learning the score function ∇_x log p_t(x) of the noised data distribution at each timestep. This score-matching perspective (Song et al., 2021) unifies diffusion with earlier score-based generative models and reveals that you can derive both deterministic (DDIM, ODE-based) and stochastic (DDPM, SDE-based) samplers from the same trained network.

The diffusion-model design space has exploded since 2020. Major axes:

Sampler. The original DDPM takes 1000 reverse steps. DDIM (Song et al., 2020) is a deterministic sampler that gives high-quality samples in 50 steps. DPM-Solver (Lu et al., 2022) and friends push this to 10-25 steps. Consistency models (Song et al., 2023) and rectified flow (Liu et al., 2022) train directly for few-step or even single-step sampling — the basis of Stable Diffusion 3 and recent fast image models.
Backbone. U-Net (original) → DiT (transformer) is the dominant trajectory at scale. MM-DiT (multimodal DiT used in SD3) treats text tokens and image patch tokens as a single sequence with joint attention.
Latent vs pixel space. Latent diffusion (SD-family) is the practical default; pixel-space (Imagen, DALL·E 2 had pixel-space components) gives sharper detail but is much more expensive.
Noise schedules. Linear, cosine (Nichol and Dhariwal, 2021), Karras schedule, EDM formulation (Karras et al., 2022). The schedule shapes which noise levels get the most training signal.
Conditioning. Frozen CLIP text encoder (early SD), T5-XXL (Imagen, SD3), Gemma-2B (Stable Diffusion 3.5). Larger conditioning encoders consistently produce better prompt following.
ControlNet, T2I-Adapter (Zhang et al., 2023). Inject spatial conditioning (depth maps, edge maps, pose) without retraining the base model. The basis of controllable image generation.
LoRA / Dreambooth fine-tuning. Low-rank or full fine-tuning to add concepts, styles, or characters. The basis of the entire customization ecosystem around Stable Diffusion.

Pixel-space diffusion — operate on raw image pixels. Sharper detail at the high-resolution end, no VAE reconstruction artifacts. Compute scales with image resolution squared — expensive past 512×512.

Latent diffusion — encode to a compressed latent space (~64× spatial compression), diffuse there, decode at the end. ~10-50× cheaper. The trade-off is the VAE bottleneck — fine details limited by the VAE’s reconstruction quality. The pragmatic default at scale.

Practical considerations#

Inference latency is dominated by step count. Each reverse step is a full forward pass through the network. A 50-step DDIM sample at 1024×1024 with a 2B-parameter U-Net runs in a few seconds on a modern GPU; the same with 1000 DDPM steps takes minutes. Step-reduction techniques (DDIM, DPM-Solver, consistency distillation) are the single biggest practical lever — going from 50 steps to 4 steps via consistency distillation is a 12× wall-clock speedup with modest quality loss.

Classifier-free guidance doubles the inference cost. Each step computes both conditional and unconditional predictions. Some optimizations (parallel CFG, distilled-CFG models like SD-XL Turbo) eliminate this.

Memory at training. Diffusion models train on noisy images; the gradient pass is the same shape as standard image classification training. Memory budget at 512×512 latent training (64×64 latent resolution) is comparable to training a ViT on ImageNet at the same parameter count. Sora-style video diffusion at minute-long sequences is dramatically more demanding — video is the next compute frontier after frontier LLMs.

Mode coverage. Unlike GANs, diffusion models don’t suffer from mode collapse — they faithfully model the entire data distribution. This is both a strength (diverse samples, no missing modes) and a weakness (will happily generate from the rare tail of the training data, including biases, watermarks, and memorized samples).

Memorization risk. Diffusion models can verbatim-reproduce training images, particularly common ones repeated in the dataset (Carlini et al., 2023). This is a real concern for commercial deployment. Mitigations: deduplication of training data, pretrain on cleaner corpora, refuse generation of celebrity names or trademarked entities at inference.

Why diffusion beats autoregressive for images

You could imagine training a transformer to autoregressively generate image pixels (or patches), the way GPT generates text — and people have tried (PixelCNN, ImageGPT, Parti). The problem is the order. Text has a natural left-to-right order; pixels don’t. Any raster-scan order makes early pixels condition late pixels in a way that doesn’t match how humans (or the data) generate images. Diffusion’s “generate the whole image at every resolution simultaneously, refine through denoising” turns out to be a much better inductive bias for spatial data. The recent trend toward generating sequences of latents with autoregressive transformers (Parti, MAGVIT, world models for video) is an interesting hybrid, but diffusion remains dominant for high-fidelity image generation.

Real-world deployments#

Diffusion models power essentially every production image-generation system in 2026:

Midjourney v6+. Proprietary diffusion model, hosted-only. Widely considered the highest-quality consumer text-to-image system.
Stable Diffusion 3 / 3.5 / SDXL. Stability AI’s open-weights diffusion models, MM-DiT backbone in SD3+. The basis of the entire open-source image-generation ecosystem (ComfyUI, AUTOMATIC1111, InvokeAI).
DALL·E 3. OpenAI’s image model, integrated into ChatGPT. Closed weights; mechanism is diffusion in latent space with strong CLIP/T5 conditioning.
Google Imagen 3 / Imagen Video. Google’s diffusion models for image and video, integrated into Gemini.
Flux (Black Forest Labs, 2024). Open-weights diffusion model from former Stability AI researchers, currently among the best open models.
OpenAI Sora. Diffusion transformer over video latents. The model that demonstrated that 60+ second coherent video generation was possible with diffusion.
Runway Gen-3, Google Veo, Kling, Hailuo. Production video diffusion models, all transformer-backboned diffusion in compressed latent space.
NVIDIA Edify, Adobe Firefly. Enterprise / creative-tool integrations.
AlphaFold 3, RFdiffusion. Protein and molecule generation — same diffusion math applied to non-image domains. RFdiffusion has produced de novo proteins validated in wet labs.

The diffusion architecture has also leaked into less obvious places: weather forecasting (GraphCast and its diffusion-based extensions), audio generation (Stable Audio, AudioLDM), 3D asset generation (DreamFusion, Magic3D), and policy learning in robotics (Diffusion Policy).

Why diffusion suddenly stopped being mysterious in 2020

The 2015 Sohl-Dickstein paper introduced denoising diffusion. It was largely ignored for five years. DDPM (Ho et al., 2020) didn’t change the math — it changed the empirical recipe. Three things made the difference. First, parameterizing the network to predict the noise ε rather than the clean image x_0 made the loss landscape much friendlier (the noise has unit variance at every timestep; the clean image has wildly different scales depending on t). Second, the cosine noise schedule (introduced shortly after) put the model’s training signal where it mattered — at intermediate noise levels where the reverse process actually has to do work. Third, the U-Net backbone at the right scale (~hundreds of millions of parameters) with attention at lower resolutions was enough capacity to capture image distributions. None of these are deep theoretical advances; they’re empirical engineering that made the existing math work. The lesson generalizes: a lot of “new architectures” are old math finally fitted to enough compute and the right reparameterization.