The Two Ideas
Local connectivity. Instead of connecting every input pixel to every output neuron, a convolutional layer looks at small patches — typically 3×3 or 5×5. It slides the same little filter across the image and computes a dot product at every position. Why it matters: edges, corners, and textures are local phenomena; modeling them locally is both more natural and more parameter-efficient. Weight sharing. The filter is the same at every position. If it detects a horizontal edge at the top of the image, it detects the same horizontal edge at the bottom. Translation shifts the output but doesn’t require new parameters — the single biggest missing piece in the linear baseline.What a Convolutional Layer Computes
A single filter of sizek×k with C_in input channels produces one feature map — a 2D grid of “how strongly this filter fired at every location”. A layer typically has many filters (say, 32 or 64), so the output is a stack of feature maps:
Pooling and Strides
Between convolutions, networks typically downsample to shrink spatial resolution and enlarge the receptive field:- Max pooling: take the largest value in each
2×2window. Cheap, effective, discards precise location. - Strided convolutions: run the filter every 2 pixels instead of every 1. Learnable downsampling.
conv → non-linearity → (conv → non-linearity) → pool blocks, halving the spatial size and doubling the channel count each stage — trading spatial resolution for semantic richness.
A Small CNN for MNIST
A textbook architecture that comfortably gets >99% test accuracy on MNIST:How to Read a CNN Accuracy Report
On MNIST the numbers are close to saturated, so differences look small but are meaningful:| Model | Test error | Reduction vs previous |
|---|---|---|
| Logistic regression (linear) | ~8.0% | — |
| 2-layer MLP | ~2.0% | 4× fewer errors |
| Small CNN (above) | ~0.8% | 2.5× fewer errors |
| Modern CNN with augmentation | ~0.2% | 4× fewer errors |
Beyond MNIST: What CNNs Look Like at Scale
The same ideas scaled up give you the architectures behind every classical vision benchmark:- LeNet-5 (1998): the original digit CNN — 2 conv + 2 FC layers.
- AlexNet (2012): the ImageNet-winning network that kicked off the deep-learning era.
- VGG / Inception / ResNet (2014-15): progressively deeper nets, with ResNet introducing skip connections to train 100+ layer networks stably.
- EfficientNet / ConvNeXt: modern CNNs tuned to match or beat transformers with better compute efficiency.