Beyond Text: Images and Long Documents
Once you can extract text from PDFs (covered in Working with PDFs), two harder problems remain: images that contain information your RAG system needs, and documents too long to process in a single pass. This page covers both.Image Integration in RAG
Not all information lives in text. Product photos, architecture diagrams, charts, and scanned receipts all carry data your RAG system might need. The question is: how do you make image content searchable?When to Extract Text from Images
When to Caption/Describe Images
Decision Framework: Text vs. Caption
| Image Type | Approach | Reason |
|---|---|---|
| Screenshots of code/logs | OCR text extraction | Text is primary content |
| Charts/graphs | Caption with data | Visual info + specific values |
| Diagrams with labels | Both (OCR + caption) | Labels + structural understanding |
| Photos of products | Caption | Visual features matter |
| Scanned text documents | OCR | Text is the content |
| Infographics | Caption | Mix of visual + text |
Handling Long Documents
Long documents (100+ pages) present unique challenges:- Can’t fit entire document in LLM context window
- Need to aggregate information across sections
- Must maintain document-level context