Skip to main content

The Memory Problem in Production Agents

Consider this real-world scenario from a customer support agent:
User: "I ordered a laptop last week"
Agent: [searches orders] "Found order #12345 for MacBook Pro"

User: "When will it arrive?"
Agent: [searches orders AGAIN] "Order #12345 ships tomorrow"

User: "Can I change the address?"
Agent: [searches orders AGAIN] "For order #12345, yes I can help"
Problem: The agent makes three redundant tool calls, wasting tokens, time, and money—costing roughly 3x what it should. Solution: Production memory architecture with working memory for session state and long-term memory for persistent knowledge.
Agent memory management is an advanced topic that requires careful customization. We cover the foundational concepts here. This area is advancing very fast, with several SDKs and even some model providing some type of memory. . To go deeper into the topic, you can read the O’Reilly report “Managing Memory for AI Agents” is avialable in the Assets folders.

Memory Architecture: Two-Tier System

Modern production agents use a two-tier memory system that mirrors how databases handle different data lifecycles:
Memory TypeDurationPurposeStorageSearch
Working MemorySingle sessionActive conversation stateRedis key-valueSimple lookup
Long-Term MemoryCross-sessionPersistent knowledgeRedis + vector indexSemantic search
Think of working memory as your L1 cache (fast, temporary) and long-term memory as your database (persistent, searchable).
Examples are based on Redis Agent Memory Server, so we can cover the foundational concepts. Other popular tools are:

Working Memory: Session-Scoped State

What it is: Durable storage for a specific conversation session—the “scratch pad” where agents track current conversation context. What belongs here:
  • Conversation messages - The actual user/assistant dialogue
  • Session-specific data - Temporary context that doesn’t need to persist
  • Tool results cache - Results from API calls to avoid redundant requests
Implementation Pattern:
from agent_memory_client import MemoryAPIClient
from agent_memory_client.models import WorkingMemory, MemoryMessage

class SessionAgent:
    def __init__(self, memory_url: str = "http://localhost:8000"):
        self.memory = MemoryAPIClient(base_url=memory_url)
        self.tool_cache = {}
    
    async def process_turn(
        self, 
        user_message: str, 
        session_id: str,
        user_id: str
    ) -> str:
        # 1. Get or create working memory session
        created, working_memory = await self.memory.get_or_create_working_memory(
            session_id=session_id,
            model_name="claude-sonnet-4-20250514",
            user_id=user_id
        )
        
        # 2. Check tool cache before making redundant calls
        cache_key = f"order_lookup:{user_id}"
        if cache_key not in self.tool_cache:
            # First time - call tool
            order_data = await self.get_order_data(user_id)
            self.tool_cache[cache_key] = order_data
        else:
            # Subsequent calls - use cached result
            order_data = self.tool_cache[cache_key]
        
        # 3. Get enriched context (includes past messages + relevant long-term memories)
        context = await self.memory.memory_prompt(
            query=user_message,
            session_id=session_id,
            long_term_search={
                "text": user_message,
                "filters": {"user_id": {"eq": user_id}},
                "limit": 5
            }
        )
        
        # 4. Generate response with full context
        response = await self.generate_response(
            messages=context.messages,
            user_message=user_message,
            order_data=order_data
        )
        
        # 5. Store the conversation turn
        await self.memory.set_working_memory(
            session_id=session_id,
            working_memory=WorkingMemory(
                session_id=session_id,
                messages=[
                    MemoryMessage(role="user", content=user_message),
                    MemoryMessage(role="assistant", content=response)
                ],
                user_id=user_id
            )
        )
        
        return response
Benefits:
  • Avoids redundant tool calls within a session (3x cost reduction in our example)
  • Maintains conversation coherence across turns
  • Automatically manages conversation window (truncates when needed)
  • Durable by default (persists across server restarts)
TTL Management: Working memory is durable by default, but you can set expiration for temporary sessions:
# Temporary session (expires after 1 hour)
working_memory = WorkingMemory(
    session_id="temp_chat_123",
    messages=[...],
    ttl_seconds=3600
)

# Durable session (default - no expiration)
working_memory = WorkingMemory(
    session_id="customer_support_456",
    messages=[...]
    # No TTL - persists until explicitly deleted
)

Long-Term Memory: Cross-Session Knowledge

What it is: Persistent, vector-indexed storage for knowledge that should be retained and searchable across all interactions—the agent’s “knowledge base.” What belongs here:
  • User preferences - “User prefers dark mode interfaces”
  • Important facts - “Customer subscription expires 2024-06-15”
  • Historical context - “User working on Python ML project”
Memory Types: Long-term memory supports three distinct types:
TypePurposeExample
SemanticFacts, preferences, general knowledge”User prefers metric units”
EpisodicEvents with temporal context”User visited Paris in March 2024”
MessageConversation records (auto-generated)Individual chat messages
Production Pattern:
from agent_memory_client.models import MemoryRecord
from datetime import datetime

class PersistentKnowledgeAgent:
    def __init__(self, memory_url: str):
        self.memory = MemoryAPIClient(base_url=memory_url)
    
    async def store_user_preferences(self, user_id: str, preferences: dict):
        """Store lasting user preferences"""
        memories = [
            MemoryRecord(
                text=f"User prefers {value} for {key}",
                memory_type="semantic",
                topics=["preferences", key],
                entities=[value],
                user_id=user_id,
                namespace="preferences"
            )
            for key, value in preferences.items()
        ]
        await self.memory.create_long_term_memories(memories)
    
    async def remember_event(self, user_id: str, event_text: str, event_date: datetime):
        """Store important events with temporal context"""
        memory = MemoryRecord(
            text=event_text,
            memory_type="episodic",
            event_date=event_date,
            topics=["events", "timeline"],
            user_id=user_id
        )
        await self.memory.create_long_term_memories([memory])
    
    async def get_relevant_context(self, user_id: str, query: str) -> list[str]:
        """Semantic search for relevant memories"""
        results = await self.memory.search_long_term_memory(
            text=query,
            filters={
                "user_id": {"eq": user_id},
                "memory_type": {"any": ["semantic", "episodic"]}
            },
            limit=5
        )
        return [memory.text for memory in results.memories]
Key Features:
  • Semantic search - Find relevant memories even without exact keyword matches
  • Automatic deduplication - Hash-based and semantic similarity detection
  • Rich metadata - Topics, entities, timestamps for precise filtering
  • Cross-session persistence - Survives server restarts and session expiration

The Three Integration Patterns

Production systems choose from three patterns for integrating memory with LLMs:

Pattern 1: LLM-Driven Memory (Tool-Based)

When to use: Conversational agents where the LLM should decide what to remember. How it works: Give the LLM tool access to memory operations.
# Get memory tools for the LLM
memory_tools = MemoryAPIClient.get_all_memory_tool_schemas()

# LLM can now call tools like:
# - search_long_term_memory(query="user food preferences")
# - create_long_term_memories(memories=[{text: "...", ...}])
# - add_memory_to_working_memory(session_id="...", memory={...})

response = await openai_client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": "You have memory tools. Use them to remember important info."},
        {"role": "user", "content": "I'm Alice and I love Italian food"}
    ],
    tools=memory_tools
)

# Handle tool calls
if response.choices[0].message.tool_calls:
    for tool_call in response.choices[0].message.tool_calls:
        result = await memory_client.resolve_function_call(
            function_name=tool_call.function.name,
            args=json.loads(tool_call.function.arguments),
            session_id="chat_alice",
            user_id="alice"
        )
Advantages: Natural conversation flow, user control, contextual decisions
Disadvantages: Token overhead, inconsistent behavior, higher costs

Pattern 2: Code-Driven Memory (Programmatic)

When to use: Applications requiring predictable memory behavior and explicit control. How it works: Your code decides when to store/retrieve memories.
class CodeDrivenAgent:
    async def get_contextual_response(
        self,
        user_message: str,
        user_id: str,
        session_id: str
    ) -> str:
        # 1. Get working memory session
        created, working_memory = await self.memory.get_or_create_working_memory(session_id)
        
        # 2. Search for relevant context
        context = await self.memory.memory_prompt(
            query=user_message,
            session_id=session_id,
            long_term_search={
                "text": user_message,
                "filters": {"user_id": {"eq": user_id}},
                "limit": 5,
                "recency_boost": True
            }
        )
        
        # 3. Generate response with enriched context
        response = await self.openai_client.chat.completions.create(
            model="gpt-4o",
            messages=context.messages  # Pre-loaded with relevant memories
        )
        
        # 4. Store interaction if it contains preferences
        if "prefer" in user_message.lower() or "like" in user_message.lower():
            await self.memory.create_long_term_memories([
                MemoryRecord(
                    text=f"User expressed: {user_message}",
                    memory_type="semantic",
                    topics=["preferences"],
                    user_id=user_id
                )
            ])
        
        return response.choices[0].message.content
Advantages: Predictable, efficient, reliable, optimizable
Disadvantages: More coding required, less natural, maintenance overhead

Pattern 3: Background Extraction (Automatic)

When to use: Systems that should learn automatically from conversations. How it works: Store conversations in working memory; system extracts important info in the background.
async def store_with_auto_extraction(
    session_id: str,
    user_message: str,
    assistant_message: str,
    user_id: str
):
    """Store conversation - system automatically extracts memories"""
    
    # Store in working memory with extraction strategy
    working_memory = WorkingMemory(
        session_id=session_id,
        messages=[
            MemoryMessage(role="user", content=user_message),
            MemoryMessage(role="assistant", content=assistant_message)
        ],
        user_id=user_id,
        long_term_memory_strategy=MemoryStrategyConfig(
            strategy="discrete",  # Extract individual facts
            config={}
        )
    )
    
    await client.set_working_memory(session_id, working_memory)
    
    # System automatically:
    # 1. Analyzes conversation for important information
    # 2. Extracts structured memories (preferences, facts, events)
    # 3. Applies contextual grounding (resolves pronouns)
    # 4. Stores in long-term memory
    # 5. Deduplicates similar memories
Advantages: Zero overhead, continuous learning, contextual grounding, scales naturally
Disadvantages: Less control, delayed availability, potential noise

Memory Extraction Strategies

The system offers four extraction strategies for background processing:
StrategyPurposeBest For
Discrete (default)Extract individual factsGeneral-purpose agents
SummaryCreate conversation summariesMeeting notes, long conversations
PreferencesFocus on user preferencesPersonalization systems
CustomDomain-specific extractionTechnical, legal, medical domains
Example: Custom Strategy for Technical Agent
tech_strategy = MemoryStrategyConfig(
    strategy="custom",
    config={
        "custom_prompt": """
        Extract technical decisions from: {message}
        Focus on:
        - Technology choices made
        - Architecture decisions
        - Implementation details
        Return JSON with memories array containing type, text, topics, entities.
        """
    }
)

working_memory = WorkingMemory(
    session_id="tech_consult",
    messages=[
        MemoryMessage(
            role="user", 
            content="Let's use PostgreSQL for the database and Redis for caching"
        )
    ],
    long_term_memory_strategy=tech_strategy
)

Combining Patterns: The Hybrid Approach

Most production systems use multiple patterns together:
class ProductionAgent:
    """Combines code-driven retrieval with background extraction"""
    
    async def chat(self, user_message: str, user_id: str, session_id: str) -> str:
        # 1. Code-driven: Get relevant context
        context = await self.memory.memory_prompt(
            query=user_message,
            session_id=session_id,
            long_term_search={
                "text": user_message,
                "filters": {"user_id": {"eq": user_id}},
                "limit": 5
            }
        )
        
        # 2. Generate response with context
        response = await self.generate_response(context.messages, user_message)
        
        # 3. Background: Store for automatic extraction
        await self.memory.set_working_memory(
            session_id,
            WorkingMemory(
                messages=[
                    MemoryMessage(role="user", content=user_message),
                    MemoryMessage(role="assistant", content=response)
                ],
                user_id=user_id,
                long_term_memory_strategy=MemoryStrategyConfig(
                    strategy="discrete",
                    config={}
                )
            )
        )
        
        return response
Pattern selection guide:
  • Start with Code-Driven for predictable results
  • Add Background Extraction for continuous learning
  • Consider LLM Tools when conversational control is important

Quick Check

  1. Memory Scoping: A user asks “What did we discuss yesterday about the project?” Which memory system do you query and why?
  2. Pattern Selection: You’re building a customer support bot that needs to remember user preferences and avoid redundant API calls within a conversation. Which integration pattern(s) should you use?

Key Takeaways

  1. Two-tier architecture - Working memory for sessions, long-term memory for knowledge
  2. Choose integration patterns based on control needs (LLM-driven vs. code-driven vs. background)
  3. Memory strategies matter - Discrete, summary, preferences, or custom extraction
  4. Production systems typically use hybrid approaches combining multiple patterns
  5. Semantic search enables intelligent retrieval beyond keyword matching