Versioning and Change Management
The Problem: You improve your prompt. It works great in testing. You deploy it. Customer complaints spike. Why It Happens:- Test set doesn’t cover real distribution
- Edge cases appear in production
- Model updates can break prompts
Monitoring and Observability
What to Track:- Requests per minute
- Average latency (p50, p95, p99)
- Cost per request
- Error rate
- User satisfaction (thumbs up/down)
- Cache hit rate
- Model distribution (if cascading)
- Open Source: Phoenix, LangFuse, Opik
- Commercial: Arize, AgentOps
- Product-Specific: Langsmith (for LangChain and LangGraph applications)
The Production Checklist
Before deploying any LLM feature: Testing:- Evaluation dataset created (100+ examples)
- Accuracy meets requirements (>90%)
- Edge cases tested
- Failure modes documented
- Cost per request measured
- Caching implemented where possible
- Model selection optimized
- Budget alerts configured
- Input validation in place
- Output validation in place
- Prompt injection defenses tested
- Fallback behavior defined
- Metrics logging configured
- Alerts set up
- Dashboard created
- On-call runbook written
- A/B testing framework ready
- Rollout plan defined (10% → 50% → 100%)
- Rollback procedure documented
- Customer communication prepared