What is Reranking?
Two-stage retrieval for better accuracy:-
Stage 1 (Fast Retrieval) - Maximize Recall:
- Goal: Don’t miss relevant documents - cast a wide net
- Method: Retrieve many candidates (e.g., top-50) using fast methods like lexical or vector search
- Trade-off: Fast but includes some noise/irrelevant results
-
Stage 2 (Reranking) - Maximize Precision:
- Goal: Keep only the truly relevant documents - filter out the noise
- Method: Use a more accurate cross-encoder model to deeply analyze each candidate and rerank them
- Trade-off: Slower but much more accurate at identifying relevance
- Fast retrievers are good at finding candidates but not great at ranking them
- Cross-encoders are excellent at ranking but too slow to run on thousands of documents
- Combining both gives you speed + accuracy
Minimal rerank pipeline
Integrating with your retriever
The complete pipeline shows:- First-pass retrieval (lexical + semantic) - Cast a wide net (top-50)
- Reciprocal Rank Fusion - Merge the results
- Cross-encoder reranking - Keep only the best (top-5)
- Keep candidate pool small (e.g., 20-100) and final top_k small (3-5).
- Run first-pass lexical + semantic in parallel; batch rerank scoring for throughput.
- Log latency and token/call costs during the lab.
Practical Exercise (15 min)
Production checklist callout
- Retrieve top-50 in parallel (lexical + semantic) → cross-encoder rerank top-20 → keep top-5 for generation.
- Log per-stage latency and cost; adjust candidate pool and rerank depth to stay within budget.