Chain-of-Thought (CoT): Making Reasoning Visible
Chain-of-Thought (CoT) is less common with reasoning models, since they already perform an explicit reasoning step. With SLMs and other non-reasoning models, however, CoT can still make a meaningful difference.That said, it’s still valuable to learn CoT techniques—they help you understand how these models think and how to effectively influence their behavior.
- Use CoT for complex reasoning; avoid for deterministic extraction/classification at temperature=0.
- Consider privacy/compliance: avoid logging sensitive intermediate reasoning.
- Cost/latency rise with longer outputs—use selectively.
- Often improves performance on reasoning tasks (magnitude varies by task/model)
- Creates “intermediate tokens” that guide the model
- Makes errors debuggable
- Code generation: 35% fewer bugs with CoT
- Math problems: 50-70% accuracy improvement
- Medical diagnosis: More reliable clinical reasoning
Self-Consistency: Voting for Reliability
The Problem: One response might be wrong due to non-determinism, ambiguous tasks, and/or valid solution paths. The Solution: Generate multiple responses and vote. When to Use:- High-stakes decisions (medical, financial, legal)
- Complex reasoning where errors are costly
- Classification tasks where confidence matters
- 5x Agent tasks = 5x cost
- Use only when accuracy justifies expense
- CoT often improves performance on reasoning benchmarks; magnitude varies by task/model (see Wei et al., 2022)
- Combining CoT + Self-Consistency can yield additional gains; magnitude varies by task/model (see Wang et al., 2022)
- Always validate on your evaluation set; do not assume universal gains
Extended Thinking: Anthropic’s Secret Weapon
Claude-Specific Feature: Claude can expose its “thinking” before answering using special tags.Prompt
- Debugging: See where reasoning went wrong
- Quality: Forces model to think before answering
- Transparency: Clients can audit AI decisions
Prompt Chaining: Breaking Complex Tasks
Single Prompt Limitations:- Context window fills up
- Errors compound
- Hard to debug
- Expensive to retry
- Each step is simple → fewer errors
- Failed steps can retry independently
- Cheaper: Only call expensive steps when needed
- Easier to evaluate and improve
- More latency (sequential calls)
- More complex code
- Multiple LLM calls (but often cheaper overall)