Pixels Are Just Numbers
A grayscale image is a grid of brightness values. A 28×28 MNIST digit is literally a 2D matrix of 784 numbers, each between 0 and 255 (or scaled to[0, 1] before feeding a network):
(H, W, 3) — or (3, H, W) depending on framework convention. Remember this because it is the single most common source of “my tensor has the wrong shape” errors.
| Framework | Preferred layout | Mnemonic |
|---|---|---|
| PyTorch, Caffe, many vision libs | (C, H, W) | ”channels first” |
| TensorFlow / Keras, NumPy image I/O, Pillow | (H, W, C) | ”channels last” |
Videos Add Time
A video is a sequence of frames, so it adds a temporal axis. The most common layout is:T is the number of frames. Early video models flattened time into the channel dimension (3 × T channels); modern ones treat time explicitly, either with 3D convolutions (Conv3D), temporal attention, or by tubeletizing patches like a ViT. The representation is the same three-digit tensor you already know — just one dimension longer.
Volumetric Studies: CT, MRI, and Friends
Medical imaging is where many newcomers trip, because “a scan” is not a single image. A CT or MRI study is a stack of 2D cross-sections through the body, and each one is a separate grayscale image.- D (“depth” or “slices”) is the number of cross-sections, typically 30–500 for CT, 100–300 for MRI per sequence.
- MRI is inherently multi-channel: the same anatomy is imaged with different pulse sequences (T1, T2, FLAIR, DWI), each highlighting different tissue properties. You stack them as channels, just like RGB.
- Spacing matters. Unlike natural images, medical volumes carry physical units: each voxel is, say,
0.8 × 0.8 × 3.0 mm. Networks are not scale-aware — if you train on 1mm-spaced scans and run on 3mm-spaced scans, you will silently degrade. Resampling to a canonical spacing is a standard preprocessing step.
- Hyperspectral imagery:
(H, W, C)whereCcan be 100+ wavelength bands (satellite remote sensing). - Point clouds from LiDAR:
(N, 3)or(N, 4)for(x, y, z, intensity)— not a grid, handled by PointNet-style architectures. - Depth maps:
(H, W)with floats — distance in meters per pixel.
The Preprocessing Pipeline
Almost every image you feed a network goes through the same stages:Inspecting Tensors In Practice
The worked example loads a JPEG, a short video clip, and a NIfTI volume and prints their shapes side-by-side so you can feel the difference.Practical Implication
Many “my model outputs nonsense” bugs come from shape mistakes that the framework happily accepts because the numbers still multiply. Always log.shape (and .dtype) at the boundaries of your pipeline — after decode, after normalization, right before the forward pass — rather than trusting the documentation of the library you called three lines above.