(class, box, score) triples. The machinery needed to make that work introduces a handful of building blocks you will see in every detector ever shipped.
The Core Building Blocks
Bounding Boxes
A detection is a rectangle plus a class. The two common formats:[0, 1]) or absolute pixel coordinates — always double-check which one your framework returns.
Intersection over Union (IoU)
IoU measures how much two boxes overlap:- Matching predictions to ground truth during evaluation: a prediction “counts” as correct if it has IoU ≥ some threshold (0.5 is classic, 0.5:0.95 is the COCO standard) with a ground-truth box of the right class.
- Suppressing duplicates (NMS, below).
Non-Maximum Suppression (NMS)
Detectors produce hundreds or thousands of candidate boxes per image — many overlapping on the same object. NMS keeps only the best one per cluster:Anchors
Where does the model “propose” boxes? Early detectors used sliding-window classifiers. Modern detectors pre-define a grid of anchor boxes — reference shapes at every spatial location:Benchmarks and Metrics
The main public benchmarks you will see cited:| Dataset | What it is | Classes | Typical metric |
|---|---|---|---|
| PASCAL VOC 2007/2012 | Classic small-scale detection benchmark | 20 | mAP @ IoU=0.5 |
| COCO | De facto standard; 80 “common objects” | 80 | mAP averaged over IoU 0.5:0.95 |
| LVIS | Long-tail version of COCO with 1,200+ classes | 1,203 | Masks + boxes, long-tail aware mAP |
| Open Images | Google’s large-scale dataset, hierarchical labels | 600 | mAP |
| BDD100K / nuScenes / Waymo Open | Autonomous-driving detection + tracking | varies | mAP, tracking metrics |
The Main Architectures at a Glance
Two-Stage Detectors: “Propose, Then Classify”
- Faster R-CNN (2015): a Region Proposal Network (RPN) suggests candidate boxes, then a second head classifies and refines each one. Accurate but slower.
- Mask R-CNN (2017): Faster R-CNN with an added mask head for instance segmentation (covered in the next page).
One-Stage Detectors: “Predict Everything At Once”
- YOLO family (v1 → v11+): predicts class + box offsets at every anchor location in a single pass. Fastest mainstream detector, ubiquitous in production, steadily improved accuracy over the years.
- SSD (2016), RetinaNet (2017): the original one-stage models. RetinaNet introduced focal loss to fix the foreground/background imbalance problem that hurt earlier one-stage detectors.
- FCOS (2019): anchor-free — predicts boxes directly from every feature-map location without pre-defined anchor shapes.
Transformer-Based Detectors
- DETR (2020): treats detection as set prediction. A transformer takes image features + a set of learned “object queries” and outputs exactly N predictions. No anchors, no NMS (a bipartite matching loss handles duplicates during training). Elegant; slower to train than anchor-based peers.
- DINO, DETR-variants, Grounding DINO: modern successors with faster convergence and open-vocabulary detection (detect objects specified by free-form text).
Open-Vocabulary Detectors
- Grounding DINO, GLIP, OWL-ViT: accept a natural-language query and detect matching objects, even for categories unseen at training time. The vision encoder is paired with a text encoder (often CLIP) exactly like the LLM → embedding pipelines from earlier modules. This is the detection analog of “prompting” and it’s rapidly becoming the default for long-tail applications.
Picking a Detector in Practice
| Constraint | Sensible default |
|---|---|
| Real-time (30+ FPS) on a single GPU | YOLOv8 / YOLOv11 |
| Best accuracy, latency not critical | DINO / Co-DETR / Cascade Mask R-CNN |
| Edge device (CPU, phone, embedded) | YOLO-Nano, EfficientDet-Lite, MobileDet |
| Open-vocabulary (“detect anything I type”) | Grounding DINO, OWL-ViT |
| Small dataset, quick prototyping | Fine-tune a pretrained YOLO or DETR |