The Problem: 10-Way Digit Classification
MNIST is 70,000 grayscale images of handwritten digits 0–9, each 28×28 pixels, split into 60K training and 10K test. It’s small enough to train in seconds and famous enough that every competing model has been measured on it — the perfect first dataset.f(x) = ŷ that maps the 784 pixel intensities to 10 class probabilities.
The Linear Classifier
The simplest choice is a single linear layer followed by softmax:W is a template for one digit class: the pixels where that template is bright tell the model “this is what a 3 looks like on average”. Training adjusts those templates to maximize agreement with the labels.
Loss: cross-entropy between ŷ and the one-hot true label y:
Minimizing cross-entropy is equivalent to maximizing the probability the model assigns to the correct class.
Training, Step by Step
What the Templates Look Like
A nice side-effect of the linear model is that each row ofW can be reshaped back to 28×28 and visualized. You see ghostly averages of each digit — a round blob for 0, a vertical stroke for 1, a loopy S for 8. The classifier is literally computing “how much does this input look like the average 3?” for every class in parallel, then picking the winner.
Where It Breaks
92% is an honest ceiling for something this simple. The failure modes are instructive because every image task has a more sophisticated version of them:- Translation sensitivity. Slide a
7two pixels to the right, and every entry inxchanges. The model has to re-learn each digit at every possible position, with no parameter sharing. Tiny shifts at test time break it. - No notion of locality. Pixel
(3, 3)and pixel(3, 4)are neighbors on the image, but to the classifier they are just two of 784 independent inputs. It cannot exploit the fact that nearby pixels form edges and strokes. - One template per class. A
4can be written “closed” or “open”, a7with or without a cross-bar. A single weight vector per class cannot represent multi-modal style. - Parameters grow with resolution. Doubling the image size quadruples the input dimension. A 224×224 RGB ImageNet image has 150,528 inputs — millions of parameters per class. The linear model doesn’t scale.
Practical Implication
The linear classifier is a useful smoke test, not a production model. If your deep network can’t beat a well-tuned linear baseline, something is wrong with your data pipeline — not your architecture. Always run the baseline first.❌ Antipattern
Jumping straight to a 50M-parameter CNN for a 10-class problem with 500 training images. You’ll overfit instantly and have no reference for what “reasonable” accuracy looks like.✅ Best Practice
Train a linear baseline, then a small CNN, then scale. Each step should measurably improve the test metric — if it doesn’t, investigate before making the model bigger.A Note on Regression
Classification isn’t the only supervised vision task. Sometimes the target is a continuous number rather than a category — and the pipeline barely changes. Replace the softmax layer with a single linear output, swap cross-entropy for mean squared error, and the same model becomes a regressor that predicts a scalar from an image. Classic examples:- Bone age estimation from a hand X-ray. A pediatric radiologist normally estimates skeletal maturity in months by comparing the scan to a reference atlas. The model takes the image in and outputs one number. Same CNN, same training loop — only the loss and the final layer change.
- Image-quality score prediction. Given a photo, predict a 0–10 human-rated quality score.
- Keypoint / pose regression. Predict
(x, y)coordinates of landmarks (face keypoints, joints, anatomical markers) instead of a class label.