Choosing the right model and optimizing costs can reduce your bill by 10x without sacrificing quality. This page covers model cascading, prompt caching, and the TOON format.Documentation Index
Fetch the complete documentation index at: https://aitutorial.dev/llms.txt
Use this file to discover all available pages before exploring further.
Model Selection Decision Framework
The Model Landscape (January 2025):| Model | Context | Cost (input/output per 1M tokens) | Best For |
|---|---|---|---|
| GPT-4 Turbo | 128K | $10 / $30 | Complex reasoning, structured output |
| GPT-3.5 Turbo | 16K | $0.50 / $1.50 | Simple tasks, high volume |
| Claude Sonnet 4.5 | 200K | $3 / $15 | Long documents, nuanced analysis |
| Claude Haiku | 200K | $0.25 / $1.25 | Fast classification, simple extraction |
| Gemini Pro 1.5 | 2M | $1.25 / $5 | Massive context, multimodal |
| Llama 3 70B | Varies | Self-hosted | On-premise requirements |
Prompt Caching: 50-90% Cost Reduction
The Problem: You’re sending the same 50K token knowledge base with EVERY request.- Cache static content (knowledge bases, system prompts)
- Don’t cache user input (changes every request)
- Structure prompts with cacheable parts first
- Monitor cache hit rates
- Adjust query patterns to maximize cache hits
Model Cascading: Using Cheap Models First
The Strategy:- Try cheap/fast model first
- If uncertain, escalate to expensive/smart model
- Can reduce cost while maintaining quality when confidence gating is reliable
- High-volume, similar tasks
- Clear confidence signals (some models provide log probabilities)
- Cost pressure but quality requirements
- Low latency requirements (cascading adds delay)
- Tasks where confidence is hard to measure
- Low volume (not worth complexity)
| Factor | Single model | Cascading |
|---|---|---|
| Accuracy | Stable, predictable | Depends on routing quality |
| Latency | Lower (one call) | Higher (fallback adds calls) |
| Cost | Higher per call | Lower on average if many low-cost wins |
| Complexity | Lower | Higher (routing, monitoring) |
TOON for Token-Efficient Context
Why: For uniform arrays of objects with primitive fields, TOON reduces token usage (often 30-60% vs JSON) and is easy for LLMs to parse. When TOON excels:- Uniform tabular arrays (same keys, primitive values).
- Large lists where repeated JSON keys dominate cost.
- Mixed/nested structures, varying field sets, or complex types.