Model Selection Decision Framework
The Model Landscape (January 2025):| Model | Context | Cost (input/output per 1M tokens) | Best For |
|---|---|---|---|
| GPT-4 Turbo | 128K | 30 | Complex reasoning, structured output |
| GPT-3.5 Turbo | 16K | 1.50 | Simple tasks, high volume |
| Claude Sonnet 4.5 | 200K | 15 | Long documents, nuanced analysis |
| Claude Haiku | 200K | 1.25 | Fast classification, simple extraction |
| Gemini Pro 1.5 | 2M | 5 | Massive context, multimodal |
| Llama 3 70B | Varies | Self-hosted | On-premise requirements |
Prompt Caching: 50-90% Cost Reduction
The Problem: You’re sending the same 50K token knowledge base with EVERY request.- Cache static content (knowledge bases, system prompts)
- Don’t cache user input (changes every request)
- Structure prompts with cacheable parts first
- Monitor cache hit rates
- Adjust query patterns to maximize cache hits
Model Cascading: Using Cheap Models First
The Strategy:- Try cheap/fast model first
- If uncertain, escalate to expensive/smart model
- Can reduce cost while maintaining quality when confidence gating is reliable
- High-volume, similar tasks
- Clear confidence signals (some models provide log probabilities)
- Cost pressure but quality requirements
- Low latency requirements (cascading adds delay)
- Tasks where confidence is hard to measure
- Low volume (not worth complexity)
| Factor | Single model | Cascading |
|---|---|---|
| Accuracy | Stable, predictable | Depends on routing quality |
| Latency | Lower (one call) | Higher (fallback adds calls) |
| Cost | Higher per call | Lower on average if many low-cost wins |
| Complexity | Lower | Higher (routing, monitoring) |
TOON for Token-Efficient Context
Why: For uniform arrays of objects with primitive fields, TOON reduces token usage (often 30-60% vs JSON) and is easy for LLMs to parse. When TOON excels:- Uniform tabular arrays (same keys, primitive values).
- Large lists where repeated JSON keys dominate cost.
- Mixed/nested structures, varying field sets, or complex types.