The Scenario
You’re building a photo triage tool for a hobbyist photographer’s backup drive. For each photo we want to:- Classify whether it contains people, pets, or neither.
- Detect and count the subjects if it does.
- Segment the main subject with SAM so the UI can cut it out as a thumbnail.
- Generate a stylized preview (diffusion model) for the hero shot of each album.
Pipeline
Stage 1 — Classifier
Start with a pretrained ResNet-18 or small ViT, replace the final layer with three outputs, and fine-tune on a handful of labeled images from the user’s own library. If you have very few examples, run a frozen CLIP image encoder and a linear probe over its embeddings — no GPU training needed.Stage 2 — Detection
Use Grounding DINO (open-vocabulary) with the prompt"person . dog . cat". You get a list of (label, box, score) triples with no fine-tuning required. Skip the image if no box scores above a confidence threshold you pick on a small validation set.
Stage 3 — Segmentation
Feed the highest-scoring detection box into SAM as a box prompt. Pick the best of SAM’s three candidate masks by score. Multiply by the image to produce a cut-out with the background removed.Stage 4 — Generation
Run Stable Diffusion in img2img mode on the cut-out thumbnail with a short style prompt and a low strength value (so composition is preserved). For speed, use a distilled / LCM variant that runs in 4 steps.Putting It Together
The driver script loops over the input directory, runs all four stages, and writes the results plus a small HTML index.Extensions
- Video: swap SAM for SAM 2 and propagate the cut-out across all frames of a short clip.
- 3D: load a DICOM series, window the volume to the soft-tissue range, and segment an organ with a pretrained TotalSegmentator.
- Train instead of prompt: replace Grounding DINO with a YOLO fine-tuned on a small dataset of the user’s own subjects — measure whether accuracy and latency actually improve.