Before retrieval works, documents must be split into searchable chunks. This page covers fixed-size, semantic, and structure-aware chunking, plus metadata enrichment.
The Chunking Problem
Before retrieval works, data must be stored and indexed (vector, keyword, or both). You can’t index a 50-page PDF as a single blob — documents must be split into chunks that are explicit enough to match against a query. How you chunk is a trade-off:
Context windows have grown (200K–1M+ tokens), but chunking is still essential — stuffing entire documents wastes tokens, costs more, and dilutes retrieval relevance at scale.
| Small chunks (100-200 tokens) | Large chunks (1000+ tokens) |
|---|
| Retrieval precision | High — matches specific queries well | Low — relevant signal gets buried in noise |
| Context preserved | Low — meaning gets cut across boundaries | High — full paragraphs and reasoning intact |
| Index size | More chunks to store and search | Fewer chunks, faster search |
| Best for | Factual Q&A, lookups | Summaries, complex reasoning |
The sweet spot depends on your use case — and you can mix strategies.
Chunking Strategies
1. Fixed-Size Chunking (Simplest)
Split by character or token count with overlap.
Pros:
- Simple to implement
- Predictable chunk sizes
- Fast
Cons:
- May split mid-sentence or mid-concept
- Ignores document structure
- Same strategy for all document types
Use when: Prototyping, homogeneous documents, speed matters most
2. Semantic Chunking (Better)
Split at semantic boundaries (paragraphs, sections).
Pros:
- Respects semantic boundaries
- Keeps related content together
- Better retrieval quality
Cons:
- Variable chunk sizes
- More complex to implement
- Still may split important concepts
Use when: General-purpose RAG, varied document types, quality > speed
3. Document Structure-Aware (Production)
Use document structure (headers, sections, list items) to guide chunking.
Pros:
- Preserves document hierarchy
- Each chunk has full context path
- Excellent for technical docs, APIs
- Enables section-based filtering
Cons:
- Requires structured input (HTML, Markdown)
- Complex implementation
- Overhead of metadata storage
Use when: Technical documentation, legal contracts, hierarchical content
Chunking Guidelines by Document Type
| Document Type | Strategy | Chunk Size | Overlap | Why |
|---|
| Blog posts | Semantic | 800-1000 char | 200 | Respect paragraphs |
| Technical docs | Structure-aware | 600-800 char | 150 | Maintain hierarchy |
| Legal contracts | Structure-aware | 1000-1500 char | 300 | Keep clauses intact |
| Chat transcripts | Semantic | 500-700 char | 100 | Conversation turns |
| Research papers | Structure-aware | 1000-1200 char | 200 | Section context |
| Product manuals | Structure-aware | 600-800 char | 150 | Step-by-step clarity |
Good metadata transforms retrieval quality. Don’t just store text — store context. Metadata lets you filter by source, type, section, or date before semantic matching, dramatically improving precision.
Always include:
- Source identifier (file path, URL, database ID)
- Timestamp (creation/modification)
- Chunk position (index, total chunks)
Include when relevant:
- Author/department (for access control)
- Document type/category (for filtering)
- Section/hierarchy (for context)
- Language (for multilingual)
- Quality scores (for ranking)
- Version (for audit trails)
Use when: You have diverse document types and need precise, filterable retrieval