The Memory Problem in Production Agents
Consider this real-world scenario from a customer support agent:Agent memory management is an advanced topic that requires careful customization. We cover the foundational concepts here. This area is advancing very fast, with several SDKs and even some model providing some type of memory. . To go deeper into the topic, you can read the O’Reilly report “Managing Memory for AI Agents” is avialable in the Assets folders.
Memory Architecture: Two-Tier System
Modern production agents use a two-tier memory system that mirrors how databases handle different data lifecycles:| Memory Type | Duration | Purpose | Storage | Search |
|---|---|---|---|---|
| Working Memory | Single session | Active conversation state | Redis key-value | Simple lookup |
| Long-Term Memory | Cross-session | Persistent knowledge | Redis + vector index | Semantic search |
Examples are based on Redis Agent Memory Server, so we can cover the foundational concepts. Other popular tools are:
- Mem0
- Zep
- File-based
- VectorDBs
- OpenAI Agent SDK sessions
- Langchain - Langsmith short-term memory and long-term memory
- Claude Memory tool
Working Memory: Session-Scoped State
What it is: Durable storage for a specific conversation session—the “scratch pad” where agents track current conversation context. What belongs here:- Conversation messages - The actual user/assistant dialogue
- Session-specific data - Temporary context that doesn’t need to persist
- Tool results cache - Results from API calls to avoid redundant requests
- Avoids redundant tool calls within a session (3x cost reduction in our example)
- Maintains conversation coherence across turns
- Automatically manages conversation window (truncates when needed)
- Durable by default (persists across server restarts)
Long-Term Memory: Cross-Session Knowledge
What it is: Persistent, vector-indexed storage for knowledge that should be retained and searchable across all interactions—the agent’s “knowledge base.” What belongs here:- User preferences - “User prefers dark mode interfaces”
- Important facts - “Customer subscription expires 2024-06-15”
- Historical context - “User working on Python ML project”
| Type | Purpose | Example |
|---|---|---|
| Semantic | Facts, preferences, general knowledge | ”User prefers metric units” |
| Episodic | Events with temporal context | ”User visited Paris in March 2024” |
| Message | Conversation records (auto-generated) | Individual chat messages |
- Semantic search - Find relevant memories even without exact keyword matches
- Automatic deduplication - Hash-based and semantic similarity detection
- Rich metadata - Topics, entities, timestamps for precise filtering
- Cross-session persistence - Survives server restarts and session expiration
The Three Integration Patterns
Production systems choose from three patterns for integrating memory with LLMs:Pattern 1: LLM-Driven Memory (Tool-Based)
When to use: Conversational agents where the LLM should decide what to remember. How it works: Give the LLM tool access to memory operations.Disadvantages: Token overhead, inconsistent behavior, higher costs
Pattern 2: Code-Driven Memory (Programmatic)
When to use: Applications requiring predictable memory behavior and explicit control. How it works: Your code decides when to store/retrieve memories.Disadvantages: More coding required, less natural, maintenance overhead
Pattern 3: Background Extraction (Automatic)
When to use: Systems that should learn automatically from conversations. How it works: Store conversations in working memory; system extracts important info in the background.Disadvantages: Less control, delayed availability, potential noise
Memory Extraction Strategies
The system offers four extraction strategies for background processing:| Strategy | Purpose | Best For |
|---|---|---|
| Discrete (default) | Extract individual facts | General-purpose agents |
| Summary | Create conversation summaries | Meeting notes, long conversations |
| Preferences | Focus on user preferences | Personalization systems |
| Custom | Domain-specific extraction | Technical, legal, medical domains |
Combining Patterns: The Hybrid Approach
Most production systems use multiple patterns together:- Start with Code-Driven for predictable results
- Add Background Extraction for continuous learning
- Consider LLM Tools when conversational control is important
Quick Check
- Memory Scoping: A user asks “What did we discuss yesterday about the project?” Which memory system do you query and why?
- Pattern Selection: You’re building a customer support bot that needs to remember user preferences and avoid redundant API calls within a conversation. Which integration pattern(s) should you use?
Key Takeaways
- Two-tier architecture - Working memory for sessions, long-term memory for knowledge
- Choose integration patterns based on control needs (LLM-driven vs. code-driven vs. background)
- Memory strategies matter - Discrete, summary, preferences, or custom extraction
- Production systems typically use hybrid approaches combining multiple patterns
- Semantic search enables intelligent retrieval beyond keyword matching