What a Neural Network Actually Is
A neural network is a stack of simple mathematical operations — mostly matrix multiplications and non-linear squashes — that together define a single, very long functionf(x; θ). You feed in x (an image, a sentence, a CT slice), out pops a prediction, and θ is the millions (or billions) of numbers that control how the transformation happens.
- Take the previous layer’s output.
- Multiply by a learned weight matrix and add a learned bias.
- Apply a non-linear function (ReLU, GELU, sigmoid, …).
How Learning Happens: Loss and Gradient Descent
Training is a feedback loop:- Forward pass: run
xthrough the network to produce a predictionŷ. - Loss: measure how wrong
ŷis compared to the ground truthyusing a task-specific function (cross-entropy for classification, MSE for regression, IoU-based losses for detection, pixel-wise cross-entropy for segmentation). - Backward pass (backpropagation): compute how each parameter should change to reduce the loss — the gradient
∂L/∂θ. - Update: nudge every parameter slightly against its gradient. Repeat on another batch.
Why “deep”?
Empirically, depth is how networks learn hierarchical representations. Early layers learn low-level patterns (edges, gradients), middle layers learn parts (eyes, wheels), late layers learn objects and scenes. You can’t get this hierarchy out of a shallow, wide network even if it has the same number of parameters. Depth + non-linearity + enough data + gradient descent is the whole recipe.Connection to the LLMs You Already Know
This is where the prompting and RAG modules come back. The models behind GPT, Claude, and Gemini are transformers — a specific deep-learning architecture where the main operation is self-attention instead of convolution. But the plumbing is identical:| Piece | LLM (transformer on text) | Vision network |
|---|---|---|
| Input tensor | (N, T, d) — batch × tokens × embedding | (N, C, H, W) — batch × channels × height × width |
| Core operation | Self-attention over tokens | Convolution over pixels, or attention over patches (ViT) |
| Non-linearity | GELU | ReLU / GELU |
| Training signal | Predict the next token → cross-entropy | Predict the label / mask / box → task-specific loss |
| Optimization | Adam/AdamW + gradient descent | Same |
- ViT (Vision Transformer) chops an image into 16×16 patches, flattens each into a token, and runs the same transformer you’d use for text. On large datasets it outperforms CNNs.
- Multimodal LLMs (GPT-4o, Claude, Gemini) use a vision encoder — often a ViT — to turn images into tokens that sit alongside text tokens in the context window.
- SAM is a transformer that takes an image + a prompt (point, box, or mask) and outputs a segmentation mask — basically “prompting for pixels”.
- Stable Diffusion generates images conditioned on a text embedding computed by a CLIP text encoder — which is itself a transformer.
Why Vision Is Usually Harder Than Text
Sharing the machinery doesn’t mean the problems are equally easy. A mid-size LLM can write a coherent essay; a similarly-sized CNN can still struggle to reliably tell one dog breed from another. The difference isn’t the optimizer or the layer count — it’s that vision problems carry more ambiguity at every level, from the raw input up to the label. Four reasons this shows up in practice: 1. Text is symbols; images are raw signal. The wordcat is always the word cat — a discrete token with a pre-compiled meaning the model can look up. A cat in an image is a pattern of light and color that could be any size, pose, color, breed, half-covered by a couch, or photographed in dim light. The model has to recover the concept of “cat” from pixels rather than index into a dictionary.
2. Text has almost no invariances; vision has many. A sentence doesn’t get rotated, partially occluded, lit from a weird angle, or zoomed in. An image of a car must be recognized whether it’s big or small, upside down, half-hidden behind a pole, or in deep shadow — and the network has to learn that none of those transformations change the answer. Every one of these invariances is another axis of variation the model has to generalize across.
3. Ground truth is fuzzier. “Translate this sentence” has roughly one right answer; “segment the tumor” has two expert radiologists drawing meaningfully different boundaries on the same scan. Labels in vision are intrinsically noisier, and the model trains against that noise. This is also why evaluation is harder: you can’t always compute a clean “is this correct?” — sometimes you can only measure agreement with imperfect human labels.
4. The input is massive. A short paragraph is a few hundred tokens at modest embedding dimension. A single 224×224 RGB image is already over 150,000 raw numbers, and a CT volume is tens of millions. The model has to extract semantic structure from a much bigger, less-curated signal — and most of those numbers are redundant or uninformative.
None of this means vision is impossible; it means good data curation, clear task definition, and the architectural priors we cover on the next pages (locality, weight sharing, translation equivariance) matter even more here than they do in text.