Skip to main content

Model Selection Decision Framework

The Model Landscape (January 2025):
ModelContextCost (input/output per 1M tokens)Best For
GPT-4 Turbo128K10/10 / 30Complex reasoning, structured output
GPT-3.5 Turbo16K0.50/0.50 / 1.50Simple tasks, high volume
Claude Sonnet 4.5200K3/3 / 15Long documents, nuanced analysis
Claude Haiku200K0.25/0.25 / 1.25Fast classification, simple extraction
Gemini Pro 1.52M1.25/1.25 / 5Massive context, multimodal
Llama 3 70BVariesSelf-hostedOn-premise requirements
Decision Tree:
Is the task simple (classification, basic extraction)?
├─ Yes → Use fastest/cheapest (Haiku, GPT-3.5)
└─ No → Continue

Do you need >100K tokens of context?
├─ Yes → Claude Sonnet or Gemini Pro
└─ No → Continue

Do you need structured JSON output?
├─ Yes → GPT-4
└─ No → Claude Sonnet (better prose)

Is cost critical (high volume)?
├─ Yes → Consider model cascading (section 5.3)
└─ No → Use best model for quality
As of publication. Verify latest pricing/context on vendor pages: OpenAI pricing, Anthropic pricing, and Google Gemini pricing/models.

Prompt Caching: 50-90% Cost Reduction

The Problem: You’re sending the same 50K token knowledge base with EVERY request.
// Without caching
for (const query of queries) {  // 10,000 queries
    const prompt = `${knowledgeBase}\n\nQuery: ${query}`;  // 50K + 100 tokens
    const response = await llm.generate(prompt);
}

// Cost: 10,000 * 50,100 tokens = 501M input tokens
// At $3/1M: $1,503
The Solution: Prompt Caching Mark reusable parts of your prompt for caching: Production Economics:
// Scenario: Customer support chatbot
// - Knowledge base: 50K tokens
// - Avg conversation: 5 queries
// - Cache hit rate: 80% (users ask follow-ups quickly)

without_caching = 10_000 queries * 50K tokens * $3/1M = $1,500
with_caching = (
    (10_000 * 0.2) * 50K   // Cache misses (20%)
    + (10_000 * 0.8) * 5K  // Cache hits (90% reduction)
) * $3/1M = $450

savings = $1,050 (70% reduction)
Best Practices:
  1. Cache static content (knowledge bases, system prompts)
  2. Don’t cache user input (changes every request)
  3. Structure prompts with cacheable parts first
  4. Monitor cache hit rates
  5. Adjust query patterns to maximize cache hits

Model Cascading: Using Cheap Models First

The Strategy:
  • Try cheap/fast model first
  • If uncertain, escalate to expensive/smart model
  • Can reduce cost while maintaining quality when confidence gating is reliable
Implementation: When Cascading Works:
  • High-volume, similar tasks
  • Clear confidence signals (some models provide log probabilities)
  • Cost pressure but quality requirements
When to Avoid:
  • Low latency requirements (cascading adds delay)
  • Tasks where confidence is hard to measure
  • Low volume (not worth complexity)
In Production:
FactorSingle modelCascading
AccuracyStable, predictableDepends on routing quality
LatencyLower (one call)Higher (fallback adds calls)
CostHigher per callLower on average if many low-cost wins
ComplexityLowerHigher (routing, monitoring)
Choose cascading when cost pressure is high and confidence signals are trustworthy; otherwise prefer simplicity.

TOON for Token-Efficient Context

Why: For uniform arrays of objects with primitive fields, TOON reduces token usage (often 30-60% vs JSON) and is easy for LLMs to parse. When TOON excels:
  • Uniform tabular arrays (same keys, primitive values).
  • Large lists where repeated JSON keys dominate cost.
When to prefer JSON:
  • Mixed/nested structures, varying field sets, or complex types.
Example (input as TOON): Full runnable TOON notebook
items[2]{sku,name,qty,price}:
  A1,Widget,2,9.99
  B2,Gadget,1,14.5
Prompting the model to output TOON:
Data is in TOON (2-space indent; arrays show [N]{fields}).
Return ONLY TOON with the same header; set [N] to match rows.
Practical exercise: Convert your JSON examples to TOON and compare input token counts and task accuracy. See: TOON repository and official site/spec