Skip to main content
Classification gives one label per image. Detection draws a box around each object. Segmentation goes one level deeper and labels every pixel — it’s what powers background removal, medical organ delineation, autonomous-driving road masks, and every “blur the background” feature on your phone. The canonical architecture for this job is the U-Net, and this page is built around it.

Three Flavors of Segmentation

Worth keeping straight because papers and APIs all assume you know the difference.
TypeWhat it labelsExample task
Semantic segmentationEach pixel gets a class label — no instance distinction”this pixel is road / sidewalk / building”
Instance segmentationEach pixel gets a class label plus an instance ID”this pixel is car #3”
Panoptic segmentationSemantic labels for stuff (sky, road), instance labels for things (car, person) — unified”this pixel is road”, “this pixel is person #2”
Output shape is always a mask of the same spatial size as the input: (H, W) of class IDs for semantic, or (N, H, W) of binary masks for instance-level.

Why Classification Networks Don’t Work Here

A classifier maps (C, H, W) → a vector of class scores. A segmenter has to produce (K, H, W) — one score per class per pixel. Three things break when you try to reuse a standard CNN:
  1. Spatial resolution collapses. A classification backbone aggressively downsamples with pooling and strides to produce a compact feature vector. If you simply upsample that final feature map back to the input size, you get blurry, low-frequency output — no fine detail.
  2. The fully-connected head expects a flattened vector, not a grid. You need to remove it and replace it with something spatial.
  3. Every pixel contributes to the loss. A single “is this a cat?” gradient doesn’t carry enough signal to teach the network where the cat’s ears end.
So the architecture needs two things: a way to recover spatial resolution on the output side, and a way to preserve the spatial detail that the encoder normally throws away.

U-Net: The Encoder-Decoder That Defined Segmentation

U-Net (Ronneberger et al., 2015) solves both problems with a symmetric encoder-decoder plus skip connections from encoder to decoder at every resolution. It was originally designed for biomedical microscopy — where it is still the workhorse — and has since spread to essentially every field that does segmentation (and, famously, it is the denoiser sitting inside every latent diffusion model).
        Encoder (contracting)                  Decoder (expanding)
        ─────────────────────                  ────────────────────

  Input (C, 256, 256)                          Output (K, 256, 256)
          │                                             ▲
          v                                             │
    ┌─ Conv block ─┐─────── skip ──────────────▶ Conv block (1/1)
    │              │                                    ▲
    │   pool ↓                             upsample ↑   │
    │              │                                    │
    └─ Conv block ─┘─────── skip ──────────▶ Conv block (1/2)
                   │                          ▲
       pool ↓                        upsample ↑
                   │                          │
         Conv block ─── skip ─────▶ Conv block (1/4)
                   │                ▲
       pool ↓              upsample ↑
                   │                │
              Bottleneck (deep features, low resolution)
What each part does:
  • Encoder (left side) — stack of conv + pool blocks, same shape as a classification CNN. Each step halves the spatial resolution and roughly doubles the channel count. Learns progressively abstract features.
  • Bottleneck — the lowest-resolution, highest-semantic feature map. What is in the image is encoded here; where is mostly lost.
  • Decoder (right side) — a mirror of the encoder. At each stage it upsamples (via transposed convolution or bilinear-plus-conv) and applies a couple of conv layers.
  • Skip connections — at each resolution, the encoder’s feature map is concatenated into the corresponding decoder stage before its conv layers run. This is the whole trick: the decoder receives both semantic context from the bottleneck and precise spatial detail from the skips.

Output and Loss

  • Output shape is (K, H, W) — one score per class per pixel. Take argmax along the class axis to get the predicted mask.
  • Loss is per-pixel cross-entropy, averaged over every pixel — exactly like classification, just H × W times per image. For imbalanced datasets (tiny foreground, huge background — common in medical and in defect detection) Dice loss or a Dice + cross-entropy hybrid is the default, because plain cross-entropy is dominated by the easy background pixels.

Training a U-Net

The training loop is identical to a classifier’s — forward pass, loss, backward pass, parameter update — just with a spatial loss. The worked example trains a small U-Net on a toy segmentation dataset so you can see the pipeline end to end. On a modest dataset (a few thousand images), a 4- or 5-stage U-Net reaches usable accuracy in minutes on a single GPU. That efficiency — strong results from a small network and little data — is exactly why it still dominates applied work.

Beyond Vanilla U-Net

U-Net is the starting point; the family has grown and branched:
  • nnU-Net (2021) — a self-configuring U-Net pipeline that inspects the dataset and picks spacing, patch size, normalization, and architecture hyperparameters automatically. Wins medical segmentation benchmarks year after year with almost no tuning.
  • U-Net++, Attention U-Net — refinements to the skip connections (nested or attention-gated) that squeeze a few points of Dice out of hard datasets.
  • Swin UNETR, UNETR — replace the convolutional encoder with a vision transformer; stronger on large or complex volumes at the cost of more compute.
  • DeepLab v3+ (2018) — atrous (dilated) convolutions for large receptive fields without downsampling; strong on outdoor scenes.
  • Mask R-CNN (2017) — adds a per-ROI mask head to a detector; the standard move when you need instance segmentation with class labels.
  • Mask2Former, OneFormer (2022) — transformer decoders with mask queries that unify semantic / instance / panoptic in a single model.
For most applied work the decision tree is small: if you need semantic masks, start with a U-Net; if you need instance IDs, use Mask R-CNN or Mask2Former; if you’re in medical imaging, go straight to nnU-Net.

Segmentation Foundation Models: A Note on Recent Work

Recent work has reframed segmentation as “prompting for pixels”. Segment Anything (SAM, Meta 2023) is a large pretrained encoder plus a prompt-aware mask decoder: click a point or draw a box, get a zero-shot mask with no fine-tuning. SAM 2 (2024) extends the idea to video with memory attention. Grounded-SAM chains an open-vocabulary detector (Grounding DINO) into SAM so you can “segment everything labeled ‘hard hat’” from a text prompt. These models don’t replace U-Net — they complement it. SAM shines for interactive tools and for segmenting objects a custom model was never trained on; U-Net (and descendants) still dominate when you have a labeled dataset and a specific target domain. The rule of thumb: if you have labels, train a U-Net; if you don’t, prompt SAM.

When To Use What

SituationRecommendation
Specialized domain with labeled data (medical, industrial QA, satellite)U-Net / nnU-Net / Swin UNETR
Instance-level masks with class labelsMask R-CNN or Mask2Former
Panoptic scene understandingMask2Former / OneFormer
”Segment the thing I clicked” in an interactive toolSAM
”Segment everything matching this text” with no fine-tuningGrounded-SAM (Grounding DINO → SAM)
Video object segmentation / rotoscopingSAM 2

Practical Implication

Segmentation’s production bottleneck is almost always labeled data, not architecture. A well-curated 500-image labeled dataset trained with a small U-Net will usually outperform a fancier model trained on 50 poorly-labeled ones. Spend the time on label quality first; swap architectures second.

❌ Antipattern

Reaching for a transformer-based Swin UNETR on a dataset of 200 images. The model has tens of millions of parameters, your dataset won’t regularize it, and training Dice will look great while test Dice collapses.

✅ Best Practice

Start with a small U-Net plus standard augmentation (random flips, small rotations, and — for medical — elastic deformations). Check whether accuracy plateaus because of the model or because of the data. If it’s the data, labeling is the highest-leverage next step. If it’s the model, then scale up.