Generative Adversarial Networks (GANs)
A GAN trains two networks against each other:- Generator
G: takes a random noise vectorzand tries to produce a realistic imageG(z). - Discriminator
D: takes an image and tries to decide whether it’s real (from the dataset) or fake (from the generator).
Strengths
- Fast sampling: one forward pass through
Gproduces an image. No iterative denoising. - High visual fidelity: StyleGAN2/3 remain extremely strong on constrained domains (faces, cars) with crisp, photorealistic outputs.
- Latent space is well-behaved: interpolation in
zgives smooth morphs between generated images; editing directions correspond to semantic attributes (age, pose, lighting).
Weaknesses
- Training instability: the minimax loss is notoriously brittle. Mode collapse (the generator produces only a narrow slice of the distribution) is common.
- Hard to condition on text: GANs were designed for unconditional or class-conditional generation. Full text-to-image conditioning never worked as well as with diffusion.
- Low diversity at scale: struggles to cover very broad distributions (e.g., “all of the internet’s images”) compared to diffusion.
Diffusion Models
Diffusion models take a very different route. The key idea: it’s easy to destroy structure by adding noise, and we can train a network to reverse that process one small step at a time.Forward process (fixed, no learning)
Start with a real imagex_0 and repeatedly add a tiny bit of Gaussian noise:
x_T looks like random noise. This process is defined analytically — no learning involved.
Reverse process (learned)
Train a neural networkε_θ to predict the noise that was added at step t, given the noisy image x_t and the step index t:
x_T and iteratively denoising:
T is typically 20–1000 depending on the sampler; modern schedulers (DDIM, DPM-Solver) get good quality in ~20 steps.
Conditioning: How Text Enters the Picture
Text-to-image works by conditioning the denoiser on a text embedding — typically from a frozen CLIP text encoder (the same CLIP used in open-vocabulary detection):Latent Diffusion (Stable Diffusion)
Running diffusion directly in pixel space is expensive. Stable Diffusion (Rombach et al., 2022) instead:- Uses a pretrained VAE to compress
512×512×3images into a64×64×4latent. - Runs the diffusion process in that latent space — 64× smaller, much faster.
- Decodes the final latent back to pixels with the VAE decoder.
CLIP encoder + U-Net / DiT denoiser + VAE decoder.
Strengths
- State-of-the-art quality and diversity on open-ended distributions.
- Natural text conditioning via cross-attention.
- Rich control ecosystem: ControlNet, LoRA fine-tuning, inpainting, img2img, image-to-video — all bolt onto the same base model.
- Stable training compared to GANs.
Weaknesses
- Slow sampling: multiple denoising steps means generation is seconds, not milliseconds. Distillation and consistency models (LCM, Turbo, SDXL Turbo) bring it down to 1–4 steps at some quality cost.
- Less editable latent space than StyleGAN’s for fine-grained attribute control (though methods like textual inversion and prompt-to-prompt partially close the gap).
GANs vs Diffusion at a Glance
| Dimension | GANs | Diffusion |
|---|---|---|
| Sampling speed | 1 forward pass | 20–1000 steps (faster with distillation) |
| Training stability | Notoriously hard | Straightforward MSE objective |
| Text conditioning | Weak | Strong (cross-attention on CLIP embeddings) |
| Typical quality ceiling | Very high on narrow domains | Very high on open-domain, general use |
| Latent space editing | Rich (StyleGAN direction space) | Growing (textual inversion, prompt edits) |
| Production default in 2026 | Niche — narrow domains, real-time | General text-to-image, image-to-image |
The Family Beyond These Two
- VAEs (Variational Autoencoders): encode and decode images through a latent bottleneck. Lower-quality samples than GANs/diffusion, but the encoder is what Stable Diffusion uses to shrink the image before diffusion.
- Autoregressive image models: predict image tokens one at a time (Parti, Muse). Very slow but high-quality and useful as tokenizers for multimodal LLMs.
- Flow matching / rectified flows: recent alternatives to the diffusion formulation with simpler training and fewer sampling steps (Stable Diffusion 3 is based on this).