Skip to main content

Versioning and Change Management

The Problem: You improve your prompt. It works great in testing. You deploy it. Customer complaints spike. Why It Happens:
  • Test set doesn’t cover real distribution
  • Edge cases appear in production
  • Model updates can break prompts
Production Pattern: Prompt Registry
function getPrompt(name: string, userId?: string): string {
    /**
     * Load prompt with A/B testing support.
     */
    const config = loadPrompts()[name];
    
    // Determine version
    let version: string;
    if (userId && shouldTest(userId, config.v3.rollout)) {
        version = "v3";
    } else {
        version = "v2";
    }
    
    return config[version].prompt;
}

Monitoring and Observability

What to Track:
async function generateResponse(prompt: string, userId: string): Promise<string> {
    const startTime = Date.now();
    
    const response = await llm.generate(prompt);
    
    // Log metrics
    logMetrics({
        user_id: userId,
        prompt_version: getVersion(prompt),
        latency: Date.now() - startTime,
        input_tokens: countTokens(prompt),
        output_tokens: countTokens(response),
        cost: calculateCost(prompt, response),
        timestamp: new Date()
    });
    
    // Check for issues
    if (detectHallucination(response)) {
        alert("Possible hallucination detected");
    }
    
    if (detectPromptInjection(prompt)) {
        alert("Prompt injection attempt");
    }
    
    return response;
}
Dashboard Metrics:
  • Requests per minute
  • Average latency (p50, p95, p99)
  • Cost per request
  • Error rate
  • User satisfaction (thumbs up/down)
  • Cache hit rate
  • Model distribution (if cascading)
AI Observability Tools: Several tools can help you implement comprehensive monitoring:
  • Open Source: Phoenix, LangFuse, Opik
  • Commercial: Arize, AgentOps
  • Product-Specific: Langsmith (for LangChain and LangGraph applications)
These tools provide features like prompt versioning, cost tracking, latency monitoring, and quality metrics out of the box.

The Production Checklist

Before deploying any LLM feature: Testing:
  • Evaluation dataset created (100+ examples)
  • Accuracy meets requirements (>90%)
  • Edge cases tested
  • Failure modes documented
Cost:
  • Cost per request measured
  • Caching implemented where possible
  • Model selection optimized
  • Budget alerts configured
Safety:
  • Input validation in place
  • Output validation in place
  • Prompt injection defenses tested
  • Fallback behavior defined
Observability:
  • Metrics logging configured
  • Alerts set up
  • Dashboard created
  • On-call runbook written
Deployment:
  • A/B testing framework ready
  • Rollout plan defined (10% → 50% → 100%)
  • Rollback procedure documented
  • Customer communication prepared