Skip to main content

Beyond Text: Images and Long Documents

Once you can extract text from PDFs (covered in Working with PDFs), two harder problems remain: images that contain information your RAG system needs, and documents too long to process in a single pass. This page covers both.

Image Integration in RAG

Not all information lives in text. Product photos, architecture diagrams, charts, and scanned receipts all carry data your RAG system might need. The question is: how do you make image content searchable?

When to Extract Text from Images

When to Caption/Describe Images

Decision Framework: Text vs. Caption

Image TypeApproachReason
Screenshots of code/logsOCR text extractionText is primary content
Charts/graphsCaption with dataVisual info + specific values
Diagrams with labelsBoth (OCR + caption)Labels + structural understanding
Photos of productsCaptionVisual features matter
Scanned text documentsOCRText is the content
InfographicsCaptionMix of visual + text

Handling Long Documents

Long documents (100+ pages) present unique challenges:
  • Can’t fit entire document in LLM context window
  • Need to aggregate information across sections
  • Must maintain document-level context

Pattern 1: Constant-Output Tasks

Use case: Finding specific information (needle in haystack)

Pattern 2: Variable-Output Tasks

Use case: Summarization, analysis (output scales with document length)

A Complete Unstructured Data Pipeline