Skip to main content
Before retrieval works, documents must be split into searchable chunks. This page covers fixed-size, semantic, and structure-aware chunking, plus metadata enrichment.

The Chunking Problem

Before retrieval works, data must be stored and indexed (vector, keyword, or both). You can’t index a 50-page PDF as a single blob — documents must be split into chunks that are explicit enough to match against a query. How you chunk is a trade-off:
Context windows have grown (200K–1M+ tokens), but chunking is still essential — stuffing entire documents wastes tokens, costs more, and dilutes retrieval relevance at scale.
Small chunks (100-200 tokens)Large chunks (1000+ tokens)
Retrieval precisionHigh — matches specific queries wellLow — relevant signal gets buried in noise
Context preservedLow — meaning gets cut across boundariesHigh — full paragraphs and reasoning intact
Index sizeMore chunks to store and searchFewer chunks, faster search
Best forFactual Q&A, lookupsSummaries, complex reasoning
The sweet spot depends on your use case — and you can mix strategies.

Chunking Strategies

1. Fixed-Size Chunking (Simplest)

Split by character or token count with overlap. Pros:
  • Simple to implement
  • Predictable chunk sizes
  • Fast
Cons:
  • May split mid-sentence or mid-concept
  • Ignores document structure
  • Same strategy for all document types
Use when: Prototyping, homogeneous documents, speed matters most

2. Semantic Chunking (Better)

Split at semantic boundaries (paragraphs, sections). Pros:
  • Respects semantic boundaries
  • Keeps related content together
  • Better retrieval quality
Cons:
  • Variable chunk sizes
  • More complex to implement
  • Still may split important concepts
Use when: General-purpose RAG, varied document types, quality > speed

3. Document Structure-Aware (Production)

Use document structure (headers, sections, list items) to guide chunking. Pros:
  • Preserves document hierarchy
  • Each chunk has full context path
  • Excellent for technical docs, APIs
  • Enables section-based filtering
Cons:
  • Requires structured input (HTML, Markdown)
  • Complex implementation
  • Overhead of metadata storage
Use when: Technical documentation, legal contracts, hierarchical content

Chunking Guidelines by Document Type

Document TypeStrategyChunk SizeOverlapWhy
Blog postsSemantic800-1000 char200Respect paragraphs
Technical docsStructure-aware600-800 char150Maintain hierarchy
Legal contractsStructure-aware1000-1500 char300Keep clauses intact
Chat transcriptsSemantic500-700 char100Conversation turns
Research papersStructure-aware1000-1200 char200Section context
Product manualsStructure-aware600-800 char150Step-by-step clarity

4. Metadata Enrichment (The Secret Weapon)

Good metadata transforms retrieval quality. Don’t just store text — store context. Metadata lets you filter by source, type, section, or date before semantic matching, dramatically improving precision. Always include:
  • Source identifier (file path, URL, database ID)
  • Timestamp (creation/modification)
  • Chunk position (index, total chunks)
Include when relevant:
  • Author/department (for access control)
  • Document type/category (for filtering)
  • Section/hierarchy (for context)
  • Language (for multilingual)
  • Quality scores (for ranking)
  • Version (for audit trails)
Use when: You have diverse document types and need precise, filterable retrieval