Module Overview
You’ve probably noticed: vision tasks look deceptively similar — “classify this image”, “find that object”, “generate a cat” — but the models, data pipelines, and failure modes behind them are wildly different. A network that’s great at classification is useless for detection, and a detector tells you nothing about pixel-perfect boundaries. Here’s why that happens: each vision problem imposes a different structure on the output. Pixels go in the same way, but what comes out — a label, a box, a mask, or a new image — reshapes the entire architecture. Underneath, though, every modern vision system relies on the same idea: images are tensors, and a deep neural network learns to transform them. In this module: you’ll learn how digital images, videos, and volumetric scans (CT, MRI) are represented as tensors, build intuition for the deep-learning machinery that also powers the LLMs from the earlier modules, and work through the canonical vision problems — classification, regression, detection, segmentation, generation — with hands-on examples on MNIST and pointers to production-scale systems.Learning Objectives
By the end of this module, you will be able to:- ✅ Represent images, videos, and volumetric studies (CT, MRI) as tensors with the correct dimensionality
- ✅ Explain deep learning in plain terms and connect it back to the transformer models from earlier modules
- ✅ Train a linear classifier on MNIST and diagnose where it breaks
- ✅ Explain why convolutions beat fully-connected layers on images and train a CNN on MNIST
- ✅ Reason about object detection building blocks: anchors, IoU, NMS, and common benchmarks
- ✅ Use Segment Anything (SAM) for interactive and automatic segmentation
- ✅ Compare GANs and diffusion models for image generation and their production trade-offs
- ✅ Extend 2D vision techniques to video and 3D volumetric data
Why This Matters
Computer vision is the other half of the production AI stack. Even if your product is text-first, vision is creeping in through OCR, screenshot understanding, multimodal agents, and document pipelines.- Vision is production-critical: radiology, autonomous driving, retail analytics, manufacturing QA, and content moderation all run on the stack covered in this module
- The same architecture under many hoods: transformers now dominate vision too — ViT, DETR, SAM, Stable Diffusion cross-attention — so everything you learned about attention in the LLM modules applies here
- Data representation determines everything downstream: a CT volume loaded as
(H, W, slices)vs(slices, H, W)silently breaks networks; choosing the right tensor layout is half the job - Benchmark literacy separates practitioners from demo-builders: knowing what “42 mAP on COCO” or “94% top-1 on ImageNet” actually means tells you which model to pick — and which numbers to distrust
What You’ll Build
- Tensor explorer — load images, videos, and volumetric studies and inspect their shapes end-to-end
- MNIST linear classifier — softmax regression trained from scratch, with baseline metrics
- MNIST CNN — a small convolutional network that shows the accuracy jump over the linear baseline
- Detection walkthrough — annotated IoU / NMS examples and a tour of one-stage vs two-stage detectors
- SAM playground — click-to-segment and automatic mask generation
- Diffusion vs GAN demo — generate images with each approach and compare fidelity and controllability
- 3D viewer — open a CT/MRI volume, scroll through slices, and extract a 2D window the network can consume
Code examples in this module are placeholders while the TypeScript/Python companion repository catches up. Each
CodeEditor block marks the file path it will point to once the implementation lands.