Three Flavors of Segmentation
Worth keeping straight because papers and APIs all assume you know the difference.| Type | What it labels | Example task |
|---|---|---|
| Semantic segmentation | Each pixel gets a class label — no instance distinction | ”this pixel is road / sidewalk / building” |
| Instance segmentation | Each pixel gets a class label plus an instance ID | ”this pixel is car #3” |
| Panoptic segmentation | Semantic labels for stuff (sky, road), instance labels for things (car, person) — unified | ”this pixel is road”, “this pixel is person #2” |
(H, W) of class IDs for semantic, or (N, H, W) of binary masks for instance-level.
Why Classification Networks Don’t Work Here
A classifier maps(C, H, W) → a vector of class scores. A segmenter has to produce (K, H, W) — one score per class per pixel. Three things break when you try to reuse a standard CNN:
- Spatial resolution collapses. A classification backbone aggressively downsamples with pooling and strides to produce a compact feature vector. If you simply upsample that final feature map back to the input size, you get blurry, low-frequency output — no fine detail.
- The fully-connected head expects a flattened vector, not a grid. You need to remove it and replace it with something spatial.
- Every pixel contributes to the loss. A single “is this a cat?” gradient doesn’t carry enough signal to teach the network where the cat’s ears end.
U-Net: The Encoder-Decoder That Defined Segmentation
U-Net (Ronneberger et al., 2015) solves both problems with a symmetric encoder-decoder plus skip connections from encoder to decoder at every resolution. It was originally designed for biomedical microscopy — where it is still the workhorse — and has since spread to essentially every field that does segmentation (and, famously, it is the denoiser sitting inside every latent diffusion model).- Encoder (left side) — stack of conv + pool blocks, same shape as a classification CNN. Each step halves the spatial resolution and roughly doubles the channel count. Learns progressively abstract features.
- Bottleneck — the lowest-resolution, highest-semantic feature map. What is in the image is encoded here; where is mostly lost.
- Decoder (right side) — a mirror of the encoder. At each stage it upsamples (via transposed convolution or bilinear-plus-conv) and applies a couple of conv layers.
- Skip connections — at each resolution, the encoder’s feature map is concatenated into the corresponding decoder stage before its conv layers run. This is the whole trick: the decoder receives both semantic context from the bottleneck and precise spatial detail from the skips.
Output and Loss
- Output shape is
(K, H, W)— one score per class per pixel. Takeargmaxalong the class axis to get the predicted mask. - Loss is per-pixel cross-entropy, averaged over every pixel — exactly like classification, just
H × Wtimes per image. For imbalanced datasets (tiny foreground, huge background — common in medical and in defect detection) Dice loss or a Dice + cross-entropy hybrid is the default, because plain cross-entropy is dominated by the easy background pixels.
Training a U-Net
The training loop is identical to a classifier’s — forward pass, loss, backward pass, parameter update — just with a spatial loss. The worked example trains a small U-Net on a toy segmentation dataset so you can see the pipeline end to end. On a modest dataset (a few thousand images), a 4- or 5-stage U-Net reaches usable accuracy in minutes on a single GPU. That efficiency — strong results from a small network and little data — is exactly why it still dominates applied work.Beyond Vanilla U-Net
U-Net is the starting point; the family has grown and branched:- nnU-Net (2021) — a self-configuring U-Net pipeline that inspects the dataset and picks spacing, patch size, normalization, and architecture hyperparameters automatically. Wins medical segmentation benchmarks year after year with almost no tuning.
- U-Net++, Attention U-Net — refinements to the skip connections (nested or attention-gated) that squeeze a few points of Dice out of hard datasets.
- Swin UNETR, UNETR — replace the convolutional encoder with a vision transformer; stronger on large or complex volumes at the cost of more compute.
- DeepLab v3+ (2018) — atrous (dilated) convolutions for large receptive fields without downsampling; strong on outdoor scenes.
- Mask R-CNN (2017) — adds a per-ROI mask head to a detector; the standard move when you need instance segmentation with class labels.
- Mask2Former, OneFormer (2022) — transformer decoders with mask queries that unify semantic / instance / panoptic in a single model.
Segmentation Foundation Models: A Note on Recent Work
Recent work has reframed segmentation as “prompting for pixels”. Segment Anything (SAM, Meta 2023) is a large pretrained encoder plus a prompt-aware mask decoder: click a point or draw a box, get a zero-shot mask with no fine-tuning. SAM 2 (2024) extends the idea to video with memory attention. Grounded-SAM chains an open-vocabulary detector (Grounding DINO) into SAM so you can “segment everything labeled ‘hard hat’” from a text prompt. These models don’t replace U-Net — they complement it. SAM shines for interactive tools and for segmenting objects a custom model was never trained on; U-Net (and descendants) still dominate when you have a labeled dataset and a specific target domain. The rule of thumb: if you have labels, train a U-Net; if you don’t, prompt SAM.When To Use What
| Situation | Recommendation |
|---|---|
| Specialized domain with labeled data (medical, industrial QA, satellite) | U-Net / nnU-Net / Swin UNETR |
| Instance-level masks with class labels | Mask R-CNN or Mask2Former |
| Panoptic scene understanding | Mask2Former / OneFormer |
| ”Segment the thing I clicked” in an interactive tool | SAM |
| ”Segment everything matching this text” with no fine-tuning | Grounded-SAM (Grounding DINO → SAM) |
| Video object segmentation / rotoscoping | SAM 2 |