Skip to main content
You can’t improve what you don’t measure. This page covers evaluation datasets, A/B testing, and systematic prompt optimization.

You Can’t Improve What You Don’t Measure

Many teams iterate on prompts by “vibes” - does the output look good? - or by fixing one scenario at a time, and then don’t check for regression testing. That doesn’t scale. The Production Process:
1. Define success criteria

2. Create evaluation dataset

3. Test current prompt

4. Analyze failures

5. Modify prompt

6. Re-test → repeat
Example: Customer Sentiment Classification: AI Evaluation Tools: Several tools can help you evaluate your prompts:
  • Open Source: LangFuse, Inspect AI, Phoenix, Opik,
  • Commercial: Braintrust, Langsmith, Arize, AgentOps
Programmatic Prompt Optimization: The manual cycle above (test → analyze → modify → re-test) can be automated. Frameworks like DSPy replace hand-written prompts with code — you define what you want, and an optimizer finds the best prompt wording for you.
import dspy
import { Quiz, QuizQuestion } from '/snippets/Quiz.jsx';

You can't improve what you don't measure. This page covers evaluation datasets, A/B testing, and systematic prompt optimization.

# Step 1: Define what you want (not how to prompt)
classify = dspy.ChainOfThought("customer_message -> sentiment")

# Step 2: Provide training examples and a metric
trainset = [
    dspy.Example(customer_message="Best purchase ever!", sentiment="positive"),
    dspy.Example(customer_message="Broke after one week.", sentiment="negative"),
]

# Step 3: Let the optimizer find the best prompt
optimizer = dspy.MIPROv2(metric=exact_match, auto="light")
optimized_classify = optimizer.compile(classify, trainset=trainset)

# DSPy automatically generates instructions, selects few-shot examples,
# and tunes the prompt — no manual tweaking required
When to Use DSPy:
  • You have an evaluation metric and training examples
  • You’re tired of manually tweaking prompt wording
  • You need to re-optimize when switching models
  • Your pipeline has multiple chained LLM calls
DSPy is particularly valuable when the manual test-modify-retest cycle becomes a bottleneck. Learn more in the DSPy introduction guide.

A/B Testing Prompts

Production Pattern: Gradual Rollout: Don’t deploy a new prompt to 100% of users immediately. Metrics to Track:
  • Task success rate
  • User satisfaction (thumbs up/down)
  • Response time
  • Cost per request
  • Error rate
Analysis After 1000 Requests:
const results = {
    v2_prompt: {
        success_rate: 0.87,
        avg_latency: 1.2,
        cost_per_request: 0.05,
        satisfaction: 0.82
    },
    v3_prompt: {
        success_rate: 0.93,  // Better!
        avg_latency: 1.4,    // Slightly slower
        cost_per_request: 0.07,  // Slightly more expensive
        satisfaction: 0.89   // Much better!
    }
};

// Decision: v3 wins → roll out to 50%, then 100%