Why Multi-Stage Retrieval?
A single retrieval pass — whether lexical or semantic — returns noisy results. Relevant documents get buried, irrelevant ones slip through, and the LLM generates worse answers as a result. Multi-stage retrieval fixes this by progressively filtering and re-scoring candidates before they reach the LLM. When you need it:- Your RAG answers are inconsistent — sometimes great, sometimes wrong
- Relevant documents exist but don’t appear in top results
- You’re using both keyword and semantic search and want the best of both
-
Stage 1 (Fast Retrieval) - Maximize Recall:
- Goal: Don’t miss relevant documents — cast a wide net
- Method: Retrieve many candidates (e.g., top-50) using fast methods like lexical and vector search in parallel
- Trade-off: Fast but includes some noise/irrelevant results
-
Stage 2 (Rank Fusion) - Merge Results:
- Goal: Combine rankings from multiple retrievers into a single list
- Method: Reciprocal Rank Fusion (RRF) scores each document by its rank across retrievers:
score = Σ 1/(k + rank) - Trade-off: Near-instant, captures strengths of both lexical and semantic search
-
Stage 3 (Reranking) - Maximize Precision:
- Goal: Keep only the truly relevant documents — filter out the noise
- Method: Use a more accurate cross-encoder model to deeply analyze each candidate and rerank them
- Trade-off: Slower but much more accurate at identifying relevance
-
Stage 4 (LLM Generation) - Generate Answer:
- Goal: Produce a grounded response using the top-ranked context
- Method: Feed the reranked documents into the LLM prompt
- Trade-off: Quality depends on all previous stages
- Fast retrievers are good at finding candidates but not great at ranking them
- Cross-encoders are excellent at ranking but too slow to run on thousands of documents
- Combining both gives you speed + accuracy
The 4-Stage Pipeline In Practice
The complete pipeline shows:- First-pass retrieval (lexical + semantic in parallel) — Cast a wide net (top-50)
- Reciprocal Rank Fusion (RRF) — Merge results into a single ranked list
- Cross-encoder reranking — Keep only the best (top-5)
- LLM generation — Generate a grounded answer from the top results
- Keep candidate pool small (e.g., 20-100) and final top_k small (3-5).
- Run first-pass lexical + semantic in parallel; batch rerank scoring for throughput.
- Log latency and token/call costs during the lab.