Documentation Index
Fetch the complete documentation index at: https://aitutorial.dev/llms.txt
Use this file to discover all available pages before exploring further.
You can’t improve what you don’t measure. This page covers evaluation datasets, A/B testing, and systematic prompt optimization.
You Can’t Improve What You Don’t Measure
Many teams iterate on prompts by “vibes” - does the output look good? - or by fixing one scenario at a time, and then don’t check for regression testing. That doesn’t scale.
The Production Process:
1. Define success criteria
↓
2. Create evaluation dataset
↓
3. Test current prompt
↓
4. Analyze failures
↓
5. Modify prompt
↓
6. Re-test → repeat
Example: Customer Sentiment Classification:
AI Evaluation Tools:
Several tools can help you evaluate your prompts:
- Open Source: LangFuse, Inspect AI, Phoenix, Opik,
- Commercial: Braintrust, Langsmith, Arize, AgentOps
Programmatic Prompt Optimization:
The manual cycle above (test → analyze → modify → re-test) can be automated. Frameworks like DSPy replace hand-written prompts with code — you define what you want, and an optimizer finds the best prompt wording for you.
import dspy
import { Quiz, QuizQuestion } from '/snippets/Quiz.jsx';
You can't improve what you don't measure. This page covers evaluation datasets, A/B testing, and systematic prompt optimization.
# Step 1: Define what you want (not how to prompt)
classify = dspy.ChainOfThought("customer_message -> sentiment")
# Step 2: Provide training examples and a metric
trainset = [
dspy.Example(customer_message="Best purchase ever!", sentiment="positive"),
dspy.Example(customer_message="Broke after one week.", sentiment="negative"),
]
# Step 3: Let the optimizer find the best prompt
optimizer = dspy.MIPROv2(metric=exact_match, auto="light")
optimized_classify = optimizer.compile(classify, trainset=trainset)
# DSPy automatically generates instructions, selects few-shot examples,
# and tunes the prompt — no manual tweaking required
When to Use DSPy:
- You have an evaluation metric and training examples
- You’re tired of manually tweaking prompt wording
- You need to re-optimize when switching models
- Your pipeline has multiple chained LLM calls
DSPy is particularly valuable when the manual test-modify-retest cycle becomes a bottleneck. Learn more in the DSPy introduction guide.
A/B Testing Prompts
Production Pattern: Gradual Rollout:
Don’t deploy a new prompt to 100% of users immediately.
Metrics to Track:
- Task success rate
- User satisfaction (thumbs up/down)
- Response time
- Cost per request
- Error rate
Analysis After 1000 Requests:
const results = {
v2_prompt: {
success_rate: 0.87,
avg_latency: 1.2,
cost_per_request: 0.05,
satisfaction: 0.82
},
v3_prompt: {
success_rate: 0.93, // Better!
avg_latency: 1.4, // Slightly slower
cost_per_request: 0.07, // Slightly more expensive
satisfaction: 0.89 // Much better!
}
};
// Decision: v3 wins → roll out to 50%, then 100%