Before retrieval works, documents must be split into searchable chunks. This page covers fixed-size, semantic, and structure-aware chunking, plus metadata enrichment.Documentation Index
Fetch the complete documentation index at: https://aitutorial.dev/llms.txt
Use this file to discover all available pages before exploring further.
The Chunking Problem
Before retrieval works, data must be stored and indexed (vector, keyword, or both). You can’t index a 50-page PDF as a single blob — documents must be split into chunks that are explicit enough to match against a query. How you chunk is a trade-off:Context windows have grown (200K–1M+ tokens), but chunking is still essential — stuffing entire documents wastes tokens, costs more, and dilutes retrieval relevance at scale.
| Small chunks (100-200 tokens) | Large chunks (1000+ tokens) | |
|---|---|---|
| Retrieval precision | High — matches specific queries well | Low — relevant signal gets buried in noise |
| Context preserved | Low — meaning gets cut across boundaries | High — full paragraphs and reasoning intact |
| Index size | More chunks to store and search | Fewer chunks, faster search |
| Best for | Factual Q&A, lookups | Summaries, complex reasoning |
Chunking Strategies
1. Fixed-Size Chunking (Simplest)
Split by character or token count with overlap. Pros:- Simple to implement
- Predictable chunk sizes
- Fast
- May split mid-sentence or mid-concept
- Ignores document structure
- Same strategy for all document types
2. Semantic Chunking (Better)
Split at semantic boundaries (paragraphs, sections). Pros:- Respects semantic boundaries
- Keeps related content together
- Better retrieval quality
- Variable chunk sizes
- More complex to implement
- Still may split important concepts
3. Document Structure-Aware (Production)
Use document structure (headers, sections, list items) to guide chunking. Pros:- Preserves document hierarchy
- Each chunk has full context path
- Excellent for technical docs, APIs
- Enables section-based filtering
- Requires structured input (HTML, Markdown)
- Complex implementation
- Overhead of metadata storage
Chunking Guidelines by Document Type
| Document Type | Strategy | Chunk Size | Overlap | Why |
|---|---|---|---|---|
| Blog posts | Semantic | 800-1000 char | 200 | Respect paragraphs |
| Technical docs | Structure-aware | 600-800 char | 150 | Maintain hierarchy |
| Legal contracts | Structure-aware | 1000-1500 char | 300 | Keep clauses intact |
| Chat transcripts | Semantic | 500-700 char | 100 | Conversation turns |
| Research papers | Structure-aware | 1000-1200 char | 200 | Section context |
| Product manuals | Structure-aware | 600-800 char | 150 | Step-by-step clarity |
4. Metadata Enrichment (The Secret Weapon)
Good metadata transforms retrieval quality. Don’t just store text — store context. Metadata lets you filter by source, type, section, or date before semantic matching, dramatically improving precision. Always include:- Source identifier (file path, URL, database ID)
- Timestamp (creation/modification)
- Chunk position (index, total chunks)
- Author/department (for access control)
- Document type/category (for filtering)
- Section/hierarchy (for context)
- Language (for multilingual)
- Quality scores (for ranking)
- Version (for audit trails)