Working with PDF and Images

The PDF and Images Challenge

What makes PDF and images “unstructured”:

No consistent schema
Mixed content types (text, images, tables)
Varied formats (PDFs, Word docs, slides)
Quality issues (scans, OCR errors, formatting)

Why it matters:

80% of enterprise data is unstructured
Most business value is locked in documents
Poor processing = poor retrieval = poor answers

PDF Processing: Digital vs. Scanned

Not all PDFs are created equal.

Digital PDFs (Text-Based)

Scanned PDFs (Image-Based)

Combined PDF Processing Pipeline

Table Extraction: The Hard Problem

Tables are notoriously difficult for OCR and even digital PDFs. Why tables are hard:

OCR sees tables as disconnected text blocks
Column alignment information is lost
Multi-line cells get fragmented
Headers vs. data distinction unclear

Solutions:

1. Specialized Table Extractors

2. Vision-Language Models for Tables

Modern approach: Use multimodal models to “see” the table. Production recommendation:

Try specialized extractor (camelot) first - fast and accurate for clean tables
Fall back to vision model for complex/messy tables - slower but more robust
Store both raw table and structured version in metadata

Image Integration in RAG

Should you include images in your RAG pipeline? Depends on the use case.

When to Extract Text from Images

When to Caption/Describe Images

Decision Framework: Text vs. Caption

Image Type	Approach	Reason
Screenshots of code/logs	OCR text extraction	Text is primary content
Charts/graphs	Caption with data	Visual info + specific values
Diagrams with labels	Both (OCR + caption)	Labels + structural understanding
Photos of products	Caption	Visual features matter
Scanned text documents	OCR	Text is the content
Infographics	Caption	Mix of visual + text

Handling Long Documents

Long documents (100+ pages) present unique challenges:

Can’t fit entire document in LLM context window
Need to aggregate information across sections
Must maintain document-level context

Pattern 1: Constant-Output Tasks

Use case: Finding specific information (needle in haystack)

Pattern 2: Variable-Output Tasks

Use case: Summarization, analysis (output scales with document length)

A Complete Unstructured Data Pipeline

Practical Exercise (20 min)

Process a mixed document corpus:

Pseudocode

/**
 * Practical Exercise: Unstructured Data Pipeline
 */
async function processMixedCorpus(documents: File[]) {
  // 1. Process text document
  const textDoc = documents.find(d => d.name === "paul_graham_essay.txt");
  const textResult = await processText(textDoc);
  
  // 2. Apply chunking strategy (see Lesson 2.3)
  const chunks = chunkDocument(textResult.content, { strategy: "semantic" });
  storeMetadata(chunks);

  // 3. (Optional) Compare with PDF/Image processing
  if (hasMultimedia(documents)) {
    const visionResult = await processVision(documents.image);
    const pdfResult = await processPDF(documents.pdf);
    
    compareLatencyAndQuality(textResult, visionResult, pdfResult);
  }

  // 4. Analyze results
  // - Measure processing time
  // - Check for quality risks (short chunks, OCR noise)
  // - Validate confidence scores
}

// Expected Insights:
// - Text path is high-confidence & fast
// - Vision/OCR adds latency & uncertainty

Home

Context Engineering & Prompt Design

Retrieval Augmented Generation (RAG)

AI Agents

Agent Reliability & Optimization

Multi-Agent Systems & Coordination

Working with PDF and Images

Working with PDF and Images

The PDF and Images Challenge

PDF Processing: Digital vs. Scanned

Digital PDFs (Text-Based)

Scanned PDFs (Image-Based)

Combined PDF Processing Pipeline

Table Extraction: The Hard Problem

1. Specialized Table Extractors

2. Vision-Language Models for Tables

Image Integration in RAG

When to Extract Text from Images

When to Caption/Describe Images

Decision Framework: Text vs. Caption

Handling Long Documents

Pattern 1: Constant-Output Tasks

Pattern 2: Variable-Output Tasks

A Complete Unstructured Data Pipeline

Practical Exercise (20 min)

Home

Context Engineering & Prompt Design

Retrieval Augmented Generation (RAG)

AI Agents

Agent Reliability & Optimization

Multi-Agent Systems & Coordination

​Working with PDF and Images

​The PDF and Images Challenge

​PDF Processing: Digital vs. Scanned

​Digital PDFs (Text-Based)

​Scanned PDFs (Image-Based)

​Combined PDF Processing Pipeline

​Table Extraction: The Hard Problem

​1. Specialized Table Extractors

​2. Vision-Language Models for Tables

​Image Integration in RAG

​When to Extract Text from Images

​When to Caption/Describe Images

​Decision Framework: Text vs. Caption

​Handling Long Documents

​Pattern 1: Constant-Output Tasks

​Pattern 2: Variable-Output Tasks

​A Complete Unstructured Data Pipeline

​Practical Exercise (20 min)

Working with PDF and Images

The PDF and Images Challenge

PDF Processing: Digital vs. Scanned

Digital PDFs (Text-Based)

Scanned PDFs (Image-Based)

Combined PDF Processing Pipeline

Table Extraction: The Hard Problem

1. Specialized Table Extractors

2. Vision-Language Models for Tables

Image Integration in RAG

When to Extract Text from Images

When to Caption/Describe Images

Decision Framework: Text vs. Caption

Handling Long Documents

Pattern 1: Constant-Output Tasks

Pattern 2: Variable-Output Tasks

A Complete Unstructured Data Pipeline

Practical Exercise (20 min)