Images & Long Documents

Not all information lives in text. This page covers image processing (OCR, captioning) and strategies for documents too long to process in a single pass.

Beyond Text: Images and Long Documents

Once you can extract text from PDFs (covered in Working with PDFs), two harder problems remain: images that contain information your RAG system needs, and documents too long to process in a single pass. This page covers both.

Image Integration in RAG

Not all information lives in text. Product photos, architecture diagrams, charts, and scanned receipts all carry data your RAG system might need. The question is: how do you make image content searchable?

When to Extract Text from Images

When to Caption/Describe Images

Decision Framework: Text vs. Caption

Image Type	Approach	Reason
Screenshots of code/logs	OCR text extraction	Text is primary content
Charts/graphs	Caption with data	Visual info + specific values
Diagrams with labels	Both (OCR + caption)	Labels + structural understanding
Photos of products	Caption	Visual features matter
Scanned text documents	OCR	Text is the content
Infographics	Caption	Mix of visual + text

Handling Long Documents

Long documents (100+ pages) present unique challenges:

Can’t fit entire document in LLM context window
Need to aggregate information across sections
Must maintain document-level context

Pattern 1: Constant-Output Tasks

Use case: Finding specific information (needle in haystack)

Pattern 2: Variable-Output Tasks

Use case: Summarization, analysis (output scales with document length)

A Complete Unstructured Data Pipeline

Working with PDFs

Evaluation & Quality Metrics

​Beyond Text: Images and Long Documents

​Image Integration in RAG

​When to Extract Text from Images

​When to Caption/Describe Images

​Decision Framework: Text vs. Caption

​Handling Long Documents

​Pattern 1: Constant-Output Tasks

​Pattern 2: Variable-Output Tasks

​A Complete Unstructured Data Pipeline