Working with PDF and Images
The PDF and Images Challenge
What makes PDF and images “unstructured”:- No consistent schema
- Mixed content types (text, images, tables)
- Varied formats (PDFs, Word docs, slides)
- Quality issues (scans, OCR errors, formatting)
- 80% of enterprise data is unstructured
- Most business value is locked in documents
- Poor processing = poor retrieval = poor answers
PDF Processing: Digital vs. Scanned
Not all PDFs are created equal.Digital PDFs (Text-Based)
Scanned PDFs (Image-Based)
Combined PDF Processing Pipeline
Table Extraction: The Hard Problem
Tables are notoriously difficult for OCR and even digital PDFs. Why tables are hard:- OCR sees tables as disconnected text blocks
- Column alignment information is lost
- Multi-line cells get fragmented
- Headers vs. data distinction unclear
1. Specialized Table Extractors
2. Vision-Language Models for Tables
Modern approach: Use multimodal models to “see” the table. Production recommendation:- Try specialized extractor (camelot) first - fast and accurate for clean tables
- Fall back to vision model for complex/messy tables - slower but more robust
- Store both raw table and structured version in metadata
Image Integration in RAG
Should you include images in your RAG pipeline? Depends on the use case.When to Extract Text from Images
When to Caption/Describe Images
Decision Framework: Text vs. Caption
| Image Type | Approach | Reason |
|---|---|---|
| Screenshots of code/logs | OCR text extraction | Text is primary content |
| Charts/graphs | Caption with data | Visual info + specific values |
| Diagrams with labels | Both (OCR + caption) | Labels + structural understanding |
| Photos of products | Caption | Visual features matter |
| Scanned text documents | OCR | Text is the content |
| Infographics | Caption | Mix of visual + text |
Handling Long Documents
Long documents (100+ pages) present unique challenges:- Can’t fit entire document in LLM context window
- Need to aggregate information across sections
- Must maintain document-level context
Pattern 1: Constant-Output Tasks
Use case: Finding specific information (needle in haystack)Pattern 2: Variable-Output Tasks
Use case: Summarization, analysis (output scales with document length)A Complete Unstructured Data Pipeline
Practical Exercise (20 min)
Process a mixed document corpus:Pseudocode