Skip to main content

Why Evaluation is Non-Negotiable

You’ve built an incredible, multi-stage RAG pipeline. Now for the hard truth: it’s failing. In production, RAG systems usually fail not because the LLM is bad, but because the context it receives is subtly wrong. A poor retrieval result, even if ranked fifth, can derail the answer. Without a clear evaluation framework, you’re debugging blind, relying on “vibe checks” that fail in production. What Problem It Solves: Systematic evaluation isolates failures. It answers: Did the system hallucinate because the relevant document wasn’t found (Retrieval Failure), or because the LLM ignored the document that was found (Generation Failure)? This single distinction saves countless hours of wasted effort on prompt tuning when the real fix is adjusting your chunking strategy. Real-World Example: Teams often spend weeks trying to fix LLM hallucination with better prompts, only to discover the root cause was a weak embedding model missing critical, domain-specific documents. We’ll learn to diagnose the real issue upfront.

The Two-Part Evaluation

The RAG pipeline has two core components that must be evaluated and debugged independently: Retrieval and Generation
  1. Retrieval Quality: Did we find the right documents?
  2. Generation Quality: Did the LLM use them correctly?
Query: "What's our refund policy?"

                  v
          +---------------+
          │   Retrieval   │ ← Evaluate: Are docs relevant?
          +---------------+

        Retrieved: Doc A, Doc B, Doc C

                  v
          +---------------+
          │  Generation   │ ← Evaluate: Is answer faithful?
          +---------------+

                  v
          Answer: "30-day returns..."

The ARES Framework for Failure Attribution

We’ll use a simplified version of the ARES framework (Answer Relevance, Evidence Support) to diagnose every failure based on three component checks:
CheckComponentGoalMetric Type
Context RelevanceRetrievalDid we retrieve the necessary ground-truth document(s)?Recall@k, NDCG@k
Answer FaithfulnessGenerationIs the answer grounded only in the retrieved context?LLM-as-Judge (Faithfulness)
Answer RelevanceGenerationDoes the answer address the original query?LLM-as-Judge (Relevance)
If you get a wrong answer, you run this checklist: If Context Relevance is low, fix the retriever. If Context Relevance is high but Faithfulness or Relevance are low, fix the generator’s prompt or the reranker.

Part 1: Retrieval Metrics (Did we find it?)

Recall@k: Are the relevant docs in the top-k?
  • Definition: What fraction of relevant documents did we retrieve?
  • When to use: Measuring if all important docs are findable
  • Target: Recall@10 > 0.90 (catch 90%+ relevant docs in first 10)
Precision@k: Are the retrieved docs relevant?
  • Definition: What fraction of retrieved documents are actually relevant?
  • When to use: Measuring retrieval noise/irrelevance
  • Target: Precision@5 > 0.80 (80%+ of top-5 are useful)
NDCG@k: Are relevant docs ranked higher?
  • Definition: Normalized Discounted Cumulative Gain - considers ranking order.
  • Why it matters: Getting relevant doc at position 1 is better than position 10.
  • When to use: Measuring overall retrieval quality (position matters)
  • Target: NDCG@10 > 0.80 (strong ranking)
Mean Reciprocal Rank (MRR): How quickly do we find relevance?
  • Definition: Average of 1/rank for first relevant doc.
  • When to use: User-focused metric (fast discovery matters)
  • Target: MRR > 0.80 (relevant doc in top ~1-2 results)

Setup and Metric Calculation

We’ll use hit_rate (a simple form of recall), MRR, Precision, and NDCG.

Part 2: Generation Quality

Now evaluate if the LLM used retrieved context correctly.

Faithfulness: Is the answer grounded in context?

Definition: Does the answer only contain info from retrieved docs? Target: Faithfulness > 0.90 (less than 10% hallucination)

Relevance: Does the answer address the query?

Definition: Is the answer on-topic and helpful for the question? Target: Relevance > 0.85 (answers actually helpful)

LLM-as-Judge

Traditional evaluation metrics require labeled ground truth, but for generation quality—measuring faithfulness and relevance—we can leverage LLMs themselves as evaluators. The LLM-as-Judge approach uses a separate, more capable LLM (like GPT-4) to act as an impartial judge that scores whether a response is faithful to the retrieved context and relevant to the original query. Main advantage: The primary benefit of LLM-as-Judge is that it can be used in production environments where there is no ground truth. Unlike retrieval metrics that require manually labeled relevant documents, or traditional answer evaluation that needs expected answers, LLM-as-Judge can evaluate any query-response pair in real-time by comparing the response against the retrieved context. This makes it ideal for continuous monitoring of production RAG systems. Why it works: Modern LLMs are remarkably good at understanding semantic relationships and detecting inconsistencies. When given a query, a response, and the source context, a judge LLM can reliably determine if the answer is grounded in the provided documents (faithfulness) and if it actually addresses what was asked (relevance). Critical requirement: Before deploying LLM-as-Judge in production, you must evaluate and optimize the judge itself using ground truth data. Create a validation set with human-annotated examples (e.g., 50-100 cases where you know the correct faithfulness/relevance scores), run the judge on these cases, and measure its accuracy against human labels. If the judge’s scores don’t align with ground truth (>85% agreement), adjust the judge model, prompts, or temperature settings. This calibration step ensures your judge is actually judging correctly before you rely on it for production monitoring. Trade-offs: While faster and more scalable than human evaluation, LLM-as-Judge has costs (API calls to the judge model) and can occasionally disagree with human annotators on edge cases. Use it for continuous monitoring and initial evaluation, but validate critical failures with human review.

Building a Golden Dataset

The most important step: You need ground truth test cases.

How to Create Golden Dataset

Golden dataset best practices:
  • Size: 50-100 queries minimum, 500+ ideal
  • Diversity: Cover all query types (factual, conceptual, multi-hop)
  • Difficulty: Include easy and hard cases
  • Updates: Refresh quarterly as corpus changes
  • Multiple annotators: Inter-annotator agreement > 0.80

Debugging RAG Failures

When evaluation reveals problems, debug systematically.

Continuous Evaluation

Monitoring frequency:
  • Log 100% of queries (for debugging)
  • Evaluate 10% of queries (cost management)
  • Full golden dataset evaluation weekly
  • Alert on >10% metric drops

Practical Exercise (20 min)

Evaluate your RAG system:
// Provided: rag_system.ts (working RAG)
// Provided: test_cases.json (20 test cases)

// Your task:
// 0. Corpus: use chunks from 'Assets/paul_graham_essay.txt' for retrieval
// 1. Load test cases (or create 20 Q/A pairs about the essay if missing)
// 2. Run RAG system on all queries
// 3. Use LlamaIndex evaluations to compute retrieval metrics (Recall@5, Precision@5, NDCG@10)
// 4. Use LlamaIndex evaluations for generation (Faithfulness, Relevance)
// 5. Identify 3 worst-performing queries
// 6. Debug: Is failure in retrieval or generation?

// Expected insights:
// - Retrieval usually bottleneck (70% of failures)
// - Multi-hop queries hardest
// - Ambiguous queries need query refinement