Chunking & Metadata Strategies

Before retrieval works, documents must be split into searchable chunks. This page covers fixed-size, semantic, and structure-aware chunking, plus metadata enrichment.

The Chunking Problem

Before retrieval works, data must be stored and indexed (vector, keyword, or both). You can’t index a 50-page PDF as a single blob — documents must be split into chunks that are explicit enough to match against a query. How you chunk is a trade-off:

Context windows have grown (200K–1M+ tokens), but chunking is still essential — stuffing entire documents wastes tokens, costs more, and dilutes retrieval relevance at scale.

	Small chunks (100-200 tokens)	Large chunks (1000+ tokens)
Retrieval precision	High — matches specific queries well	Low — relevant signal gets buried in noise
Context preserved	Low — meaning gets cut across boundaries	High — full paragraphs and reasoning intact
Index size	More chunks to store and search	Fewer chunks, faster search
Best for	Factual Q&A, lookups	Summaries, complex reasoning

The sweet spot depends on your use case — and you can mix strategies.

Chunking Strategies

1. Fixed-Size Chunking (Simplest)

Split by character or token count with overlap. Pros:

Simple to implement
Predictable chunk sizes
Fast

Cons:

May split mid-sentence or mid-concept
Ignores document structure
Same strategy for all document types

Use when: Prototyping, homogeneous documents, speed matters most

2. Semantic Chunking (Better)

Split at semantic boundaries (paragraphs, sections). Pros:

Respects semantic boundaries
Keeps related content together
Better retrieval quality

Cons:

Variable chunk sizes
More complex to implement
Still may split important concepts

Use when: General-purpose RAG, varied document types, quality > speed

3. Document Structure-Aware (Production)

Use document structure (headers, sections, list items) to guide chunking. Pros:

Preserves document hierarchy
Each chunk has full context path
Excellent for technical docs, APIs
Enables section-based filtering

Cons:

Requires structured input (HTML, Markdown)
Complex implementation
Overhead of metadata storage

Use when: Technical documentation, legal contracts, hierarchical content

Chunking Guidelines by Document Type

Document Type	Strategy	Chunk Size	Overlap	Why
Blog posts	Semantic	800-1000 char	200	Respect paragraphs
Technical docs	Structure-aware	600-800 char	150	Maintain hierarchy
Legal contracts	Structure-aware	1000-1500 char	300	Keep clauses intact
Chat transcripts	Semantic	500-700 char	100	Conversation turns
Research papers	Structure-aware	1000-1200 char	200	Section context
Product manuals	Structure-aware	600-800 char	150	Step-by-step clarity

4. Metadata Enrichment (The Secret Weapon)

Good metadata transforms retrieval quality. Don’t just store text — store context. Metadata lets you filter by source, type, section, or date before semantic matching, dramatically improving precision. Always include:

Source identifier (file path, URL, database ID)
Timestamp (creation/modification)
Chunk position (index, total chunks)

Include when relevant:

Author/department (for access control)
Document type/category (for filtering)
Section/hierarchy (for context)
Language (for multilingual)
Quality scores (for ranking)
Version (for audit trails)

Use when: You have diverse document types and need precise, filterable retrieval

​The Chunking Problem

​Chunking Strategies

​1. Fixed-Size Chunking (Simplest)

​2. Semantic Chunking (Better)

​3. Document Structure-Aware (Production)

​Chunking Guidelines by Document Type

​4. Metadata Enrichment (The Secret Weapon)

The Chunking Problem

Chunking Strategies

1. Fixed-Size Chunking (Simplest)

2. Semantic Chunking (Better)

3. Document Structure-Aware (Production)

Chunking Guidelines by Document Type

4. Metadata Enrichment (The Secret Weapon)