Video: Time as an Extra Axis
A video adds a temporal axis on top of the usual spatial ones:Frame-by-Frame
The simplest option: run your 2D model on every frame independently, optionally smooth the outputs with a temporal filter or tracker. Works astonishingly well for many tasks (detection, classification, segmentation) because modern 2D models are strong and video frames are highly redundant. This is the default you should try first.3D Convolutions
ReplaceConv2D(k, k) with Conv3D(t, k, k). The filter now slides across time as well as space and can learn motion features directly. Classic models: C3D, I3D (inflated 3D — take a 2D ImageNet-pretrained network and “inflate” every 2D filter into 3D). Expensive, but still a competitive baseline.
Two-Stream
One network sees RGB frames, another sees optical flow (the pixel-level motion between consecutive frames). Their predictions get combined. Popular in early deep video work; still useful when motion is the primary signal (e.g., action recognition).Temporal Transformers
Tubelet an image into patches over both space and time, then run a transformer. Models: ViViT, TimeSformer, MViT, VideoMAE, Video Swin. These have largely taken over video classification and are the backbone behind modern video foundation models and text-to-video generators like Sora.Tasks You’ll See
| Task | What it outputs |
|---|---|
| Action recognition | A class per clip (“running”, “cooking”) — the “ImageNet” of video |
| Temporal action detection | (class, start_time, end_time) triples across a long video |
| Multi-object tracking (MOT) | Per-object trajectories across frames — detector + association algorithm (SORT, ByteTrack) |
| Video object segmentation (VOS) | A mask for each target across every frame — SAM 2 is the current default |
| Text-to-video generation | A video sampled from a text prompt — Sora, Gen-3, Veo |
Volumetric Data: 3D From the Start
Unlike video, where most of the information is still 2D and time is a bonus, medical volumes are genuinely 3D: anatomy is continuous across slices, and features (tumors, vessels, bones) live in 3D space.CT (Computed Tomography)
A CT scanner rotates an X-ray source and detector around the patient and reconstructs a stack of cross-sectional slices. Output is typically:[0, 1]) is the equivalent of “normalization” for CT — different windows emphasize different tissues (lung window, bone window, soft-tissue window). Your network will see completely different images depending on which window you pick.
MRI (Magnetic Resonance Imaging)
MRI produces volumes too, but with a twist: the same anatomy is imaged with multiple pulse sequences (T1, T2, FLAIR, DWI), each highlighting different tissue properties. Treat each sequence as a channel:DICOM, NIfTI, and the Loader Problem
Medical data formats are their own ecosystem:- DICOM (.dcm): the clinical standard. Each slice is a separate file with extensive metadata (patient ID, acquisition parameters, orientation). Loading requires sorting slices by
SliceLocationand respecting patient orientation. - NIfTI (.nii / .nii.gz): the research standard. A single file containing the 3D array plus an affine matrix mapping voxel coordinates to physical (world) space.
- Affine / orientation: the affine matrix tells you which axis is superior-inferior, anterior-posterior, left-right in the patient’s frame. Training on mixed orientations without canonicalizing produces bizarre bugs.
Architectures for 3D Medical Imaging
- 3D U-Net: the direct extension of 2D U-Net with
Conv3D— still the workhorse. - nnU-Net (2021): a self-configuring 3D U-Net pipeline. It inspects your dataset and picks spacing, patching, and architecture automatically. Wins medical segmentation benchmarks year after year with almost no tuning.
- Swin UNETR, UNETR: transformer-based 3D encoders that trade more compute for better long-range context.
- TotalSegmentator: a nnU-Net-based model that segments 100+ anatomical structures on whole-body CT out of the box.
Why You Can’t Just Use 2D Models Slice-by-Slice
You can, and for some tasks (bone fracture detection, large-organ outlines) it works fine. But:- Small structures (a 3-voxel nodule, a thin vessel) visible across only a few slices get missed or flicker between them.
- 3D context matters: a structure’s cross-section often looks the same as something else in a single slice and is only distinguishable via its 3D shape.
- Memory: 3D U-Nets on full
(300, 512, 512)volumes don’t fit on a GPU. You train on patches (e.g.,(96, 96, 96)) and stitch predictions at inference — a standard technique with gotchas at patch boundaries.