Skip to main content

Lexical Search (Keyword)

How it works: Statistical matching on exact terms and their frequencies (BM25). Strengths:
  • Low-latency on well-indexed corpora (actual latency depends on engine, index, and hardware)
  • Excellent for exact matches and rare terms
  • No model training required
  • Interpretable results
Weaknesses:
  • Misses synonyms (“car” ≠ “automobile”)
  • Struggles with conceptual queries
  • Language-specific (requires stemming/lemmatization)
Best for:
  • Legal document search (exact statute numbers)
  • Code search (function names, error codes)
  • Product SKU lookup
  • Any domain with precise terminology

Semantic Search (Vector)

How it works: Converts text to vectors in semantic space; similar meaning = nearby vectors (Embeddings). Strengths:
  • Handles synonyms and paraphrasing
  • Works across languages (multilingual models)
  • Captures conceptual similarity
  • No query engineering needed
Weaknesses:
  • Slower than BM25 (100-500ms for large collections)
  • Misses exact matches if semantically “boring”
  • Black box (hard to debug why something matched)
  • Requires GPU for large-scale indexing
Best for:
  • Customer support (intent-based)
  • Research papers (conceptual queries)
  • Multilingual search
  • FAQ matching

Hybrid Search (Lexical + Semantic)

The Production Standard: Combine lexical (keyword) and semantic (vector) search. Why hybrid wins:
  • Catches exact matches lexical search excels at
  • Catches semantic matches vector search excels at
  • Often improves retrieval quality across diverse corpora (magnitude varies by dataset and metric)
  • Commonly used in production systems
Implementation approaches:
  1. Weighted fusion: Combine scores with learned weights
  2. Rank fusion: Merge ranked lists (Reciprocal Rank Fusion - RRF)
  3. Two-stage: Lexical first pass → semantic reranking

When to Use Each Strategy

Query TypeBest StrategyExample
Exact terminologyLexical”ICD-10 code M54.5” (medical)
Product codes/IDsLexical”SKU-2847-B”
Conceptual questionSemantic”How do I improve sleep?”
Paraphrased intentSemantic”Can’t sign in” → password reset
Mixed (most production)Hybrid”Latest Python security updates”

In Production

Cost Impact:
  • Lexical: ~$0.0001 per query (compute only, no API calls)
  • Semantic: ~$0.001-0.01 per query (embedding API + vector DB)
  • Hybrid: ~$0.002-0.015 per query (both methods)
Performance:
  • Lexical: typically lower latency at moderate scales
  • Semantic: generally higher latency than lexical; depends on index type and hardware
  • Hybrid: adds overhead; parallel execution helps
Accuracy (typical):
  • Lexical alone: 70-75% relevant results
  • Semantic alone: 72-78% relevant results
  • Hybrid: 85-92% relevant results
Recommendation: Start with hybrid unless you have strict latency requirements (<50ms) or very clear use case for lexical-only.

Practical Exercise

Implement and compare all three approaches on the same dataset:
Pseudocode
// Pseudocode: Comparison Task Logic
async function compareSearchStrategies() {
  // 1. Data Preparation
  const essay = await loadFile('Assets/paul_graham_essay.txt');
  const corpus = chunkDocument(essay);

  // 2. Define Benchmark
  const queries = [
    "exact phrases",        // Targets BM25
    "thematic concepts",    // Targets Semantic
    "mixed queries"         // Targets Hybrid
  ];

  // 3. Execution & Comparison
  for (const query of queries) {
    const lexical = searchLexical(corpus, query);
    const semantic = searchSemantic(corpus, query);
    const hybrid = searchHybrid(corpus, query);

    // 4. Analysis
    logStrategyWins({ lexical, semantic, hybrid });
  }
}