Skip to main content

Working with PDF and Images

The PDF and Images Challenge

What makes PDF and images “unstructured”:
  • No consistent schema
  • Mixed content types (text, images, tables)
  • Varied formats (PDFs, Word docs, slides)
  • Quality issues (scans, OCR errors, formatting)
Why it matters:
  • 80% of enterprise data is unstructured
  • Most business value is locked in documents
  • Poor processing = poor retrieval = poor answers

PDF Processing: Digital vs. Scanned

Not all PDFs are created equal.

Digital PDFs (Text-Based)

Scanned PDFs (Image-Based)

Combined PDF Processing Pipeline

Table Extraction: The Hard Problem

Tables are notoriously difficult for OCR and even digital PDFs. Why tables are hard:
  • OCR sees tables as disconnected text blocks
  • Column alignment information is lost
  • Multi-line cells get fragmented
  • Headers vs. data distinction unclear
Solutions:

1. Specialized Table Extractors

2. Vision-Language Models for Tables

Modern approach: Use multimodal models to “see” the table. Production recommendation:
  1. Try specialized extractor (camelot) first - fast and accurate for clean tables
  2. Fall back to vision model for complex/messy tables - slower but more robust
  3. Store both raw table and structured version in metadata

Image Integration in RAG

Should you include images in your RAG pipeline? Depends on the use case.

When to Extract Text from Images

When to Caption/Describe Images

Decision Framework: Text vs. Caption

Image TypeApproachReason
Screenshots of code/logsOCR text extractionText is primary content
Charts/graphsCaption with dataVisual info + specific values
Diagrams with labelsBoth (OCR + caption)Labels + structural understanding
Photos of productsCaptionVisual features matter
Scanned text documentsOCRText is the content
InfographicsCaptionMix of visual + text

Handling Long Documents

Long documents (100+ pages) present unique challenges:
  • Can’t fit entire document in LLM context window
  • Need to aggregate information across sections
  • Must maintain document-level context

Pattern 1: Constant-Output Tasks

Use case: Finding specific information (needle in haystack)

Pattern 2: Variable-Output Tasks

Use case: Summarization, analysis (output scales with document length)

A Complete Unstructured Data Pipeline

Practical Exercise (20 min)

Process a mixed document corpus:
Pseudocode
/**
 * Practical Exercise: Unstructured Data Pipeline
 */
async function processMixedCorpus(documents: File[]) {
  // 1. Process text document
  const textDoc = documents.find(d => d.name === "paul_graham_essay.txt");
  const textResult = await processText(textDoc);
  
  // 2. Apply chunking strategy (see Lesson 2.3)
  const chunks = chunkDocument(textResult.content, { strategy: "semantic" });
  storeMetadata(chunks);

  // 3. (Optional) Compare with PDF/Image processing
  if (hasMultimedia(documents)) {
    const visionResult = await processVision(documents.image);
    const pdfResult = await processPDF(documents.pdf);
    
    compareLatencyAndQuality(textResult, visionResult, pdfResult);
  }

  // 4. Analyze results
  // - Measure processing time
  // - Check for quality risks (short chunks, OCR noise)
  // - Validate confidence scores
}

// Expected Insights:
// - Text path is high-confidence & fast
// - Vision/OCR adds latency & uncertainty