Agent Memory

The Memory Problem in Production Agents

Consider this real-world scenario from a customer support agent:

User: "I ordered a laptop last week"
Agent: [searches orders] "Found order #12345 for MacBook Pro"

User: "When will it arrive?"
Agent: [searches orders AGAIN] "Order #12345 ships tomorrow"

User: "Can I change the address?"
Agent: [searches orders AGAIN] "For order #12345, yes I can help"

Problem: The agent makes three redundant tool calls, wasting tokens, time, and money—costing roughly 3x what it should. Solution: Production memory architecture with working memory for session state and long-term memory for persistent knowledge.

Agent memory management is an advanced topic that requires careful customization. We cover the foundational concepts here. This area is advancing very fast, with several SDKs and even some model providing some type of memory. . To go deeper into the topic, you can read the O’Reilly report “Managing Memory for AI Agents” is avialable in the Assets folders.

Memory Architecture: Two-Tier System

Modern production agents use a two-tier memory system that mirrors how databases handle different data lifecycles:

Memory Type	Duration	Purpose	Storage	Search
Working Memory	Single session	Active conversation state	Redis key-value	Simple lookup
Long-Term Memory	Cross-session	Persistent knowledge	Redis + vector index	Semantic search

Think of working memory as your L1 cache (fast, temporary) and long-term memory as your database (persistent, searchable).

Examples are based on Redis Agent Memory Server, so we can cover the foundational concepts. Other popular tools are:

Mem0

Zep

File-based

VectorDBs

OpenAI Agent SDK sessions

Langchain - Langsmith short-term memory and long-term memory

Claude Memory tool

Working Memory: Session-Scoped State

What it is: Durable storage for a specific conversation session—the “scratch pad” where agents track current conversation context. What belongs here:

Conversation messages - The actual user/assistant dialogue
Session-specific data - Temporary context that doesn’t need to persist
Tool results cache - Results from API calls to avoid redundant requests

Implementation Pattern:

from agent_memory_client import MemoryAPIClient
from agent_memory_client.models import WorkingMemory, MemoryMessage

class SessionAgent:
    def __init__(self, memory_url: str = "http://localhost:8000"):
        self.memory = MemoryAPIClient(base_url=memory_url)
        self.tool_cache = {}
    
    async def process_turn(
        self, 
        user_message: str, 
        session_id: str,
        user_id: str
    ) -> str:
        # 1. Get or create working memory session
        created, working_memory = await self.memory.get_or_create_working_memory(
            session_id=session_id,
            model_name="claude-sonnet-4-20250514",
            user_id=user_id
        )
        
        # 2. Check tool cache before making redundant calls
        cache_key = f"order_lookup:{user_id}"
        if cache_key not in self.tool_cache:
            # First time - call tool
            order_data = await self.get_order_data(user_id)
            self.tool_cache[cache_key] = order_data
        else:
            # Subsequent calls - use cached result
            order_data = self.tool_cache[cache_key]
        
        # 3. Get enriched context (includes past messages + relevant long-term memories)
        context = await self.memory.memory_prompt(
            query=user_message,
            session_id=session_id,
            long_term_search={
                "text": user_message,
                "filters": {"user_id": {"eq": user_id}},
                "limit": 5
            }
        )
        
        # 4. Generate response with full context
        response = await self.generate_response(
            messages=context.messages,
            user_message=user_message,
            order_data=order_data
        )
        
        # 5. Store the conversation turn
        await self.memory.set_working_memory(
            session_id=session_id,
            working_memory=WorkingMemory(
                session_id=session_id,
                messages=[
                    MemoryMessage(role="user", content=user_message),
                    MemoryMessage(role="assistant", content=response)
                ],
                user_id=user_id
            )
        )
        
        return response

Benefits:

Avoids redundant tool calls within a session (3x cost reduction in our example)
Maintains conversation coherence across turns
Automatically manages conversation window (truncates when needed)
Durable by default (persists across server restarts)

TTL Management: Working memory is durable by default, but you can set expiration for temporary sessions:

# Temporary session (expires after 1 hour)
working_memory = WorkingMemory(
    session_id="temp_chat_123",
    messages=[...],
    ttl_seconds=3600
)

# Durable session (default - no expiration)
working_memory = WorkingMemory(
    session_id="customer_support_456",
    messages=[...]
    # No TTL - persists until explicitly deleted
)

Long-Term Memory: Cross-Session Knowledge

What it is: Persistent, vector-indexed storage for knowledge that should be retained and searchable across all interactions—the agent’s “knowledge base.” What belongs here:

User preferences - “User prefers dark mode interfaces”
Important facts - “Customer subscription expires 2024-06-15”
Historical context - “User working on Python ML project”

Memory Types: Long-term memory supports three distinct types:

Type	Purpose	Example
Semantic	Facts, preferences, general knowledge	”User prefers metric units”
Episodic	Events with temporal context	”User visited Paris in March 2024”
Message	Conversation records (auto-generated)	Individual chat messages

Production Pattern:

from agent_memory_client.models import MemoryRecord
from datetime import datetime

class PersistentKnowledgeAgent:
    def __init__(self, memory_url: str):
        self.memory = MemoryAPIClient(base_url=memory_url)
    
    async def store_user_preferences(self, user_id: str, preferences: dict):
        """Store lasting user preferences"""
        memories = [
            MemoryRecord(
                text=f"User prefers {value} for {key}",
                memory_type="semantic",
                topics=["preferences", key],
                entities=[value],
                user_id=user_id,
                namespace="preferences"
            )
            for key, value in preferences.items()
        ]
        await self.memory.create_long_term_memories(memories)
    
    async def remember_event(self, user_id: str, event_text: str, event_date: datetime):
        """Store important events with temporal context"""
        memory = MemoryRecord(
            text=event_text,
            memory_type="episodic",
            event_date=event_date,
            topics=["events", "timeline"],
            user_id=user_id
        )
        await self.memory.create_long_term_memories([memory])
    
    async def get_relevant_context(self, user_id: str, query: str) -> list[str]:
        """Semantic search for relevant memories"""
        results = await self.memory.search_long_term_memory(
            text=query,
            filters={
                "user_id": {"eq": user_id},
                "memory_type": {"any": ["semantic", "episodic"]}
            },
            limit=5
        )
        return [memory.text for memory in results.memories]

Key Features:

Semantic search - Find relevant memories even without exact keyword matches
Automatic deduplication - Hash-based and semantic similarity detection
Rich metadata - Topics, entities, timestamps for precise filtering
Cross-session persistence - Survives server restarts and session expiration

The Three Integration Patterns

Production systems choose from three patterns for integrating memory with LLMs:

Pattern 1: LLM-Driven Memory (Tool-Based)

When to use: Conversational agents where the LLM should decide what to remember. How it works: Give the LLM tool access to memory operations.

# Get memory tools for the LLM
memory_tools = MemoryAPIClient.get_all_memory_tool_schemas()

# LLM can now call tools like:
# - search_long_term_memory(query="user food preferences")
# - create_long_term_memories(memories=[{text: "...", ...}])
# - add_memory_to_working_memory(session_id="...", memory={...})

response = await openai_client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": "You have memory tools. Use them to remember important info."},
        {"role": "user", "content": "I'm Alice and I love Italian food"}
    ],
    tools=memory_tools
)

# Handle tool calls
if response.choices[0].message.tool_calls:
    for tool_call in response.choices[0].message.tool_calls:
        result = await memory_client.resolve_function_call(
            function_name=tool_call.function.name,
            args=json.loads(tool_call.function.arguments),
            session_id="chat_alice",
            user_id="alice"
        )

Advantages: Natural conversation flow, user control, contextual decisions
Disadvantages: Token overhead, inconsistent behavior, higher costs

Pattern 2: Code-Driven Memory (Programmatic)

When to use: Applications requiring predictable memory behavior and explicit control. How it works: Your code decides when to store/retrieve memories.

class CodeDrivenAgent:
    async def get_contextual_response(
        self,
        user_message: str,
        user_id: str,
        session_id: str
    ) -> str:
        # 1. Get working memory session
        created, working_memory = await self.memory.get_or_create_working_memory(session_id)
        
        # 2. Search for relevant context
        context = await self.memory.memory_prompt(
            query=user_message,
            session_id=session_id,
            long_term_search={
                "text": user_message,
                "filters": {"user_id": {"eq": user_id}},
                "limit": 5,
                "recency_boost": True
            }
        )
        
        # 3. Generate response with enriched context
        response = await self.openai_client.chat.completions.create(
            model="gpt-4o",
            messages=context.messages  # Pre-loaded with relevant memories
        )
        
        # 4. Store interaction if it contains preferences
        if "prefer" in user_message.lower() or "like" in user_message.lower():
            await self.memory.create_long_term_memories([
                MemoryRecord(
                    text=f"User expressed: {user_message}",
                    memory_type="semantic",
                    topics=["preferences"],
                    user_id=user_id
                )
            ])
        
        return response.choices[0].message.content

Advantages: Predictable, efficient, reliable, optimizable
Disadvantages: More coding required, less natural, maintenance overhead

Pattern 3: Background Extraction (Automatic)

When to use: Systems that should learn automatically from conversations. How it works: Store conversations in working memory; system extracts important info in the background.

async def store_with_auto_extraction(
    session_id: str,
    user_message: str,
    assistant_message: str,
    user_id: str
):
    """Store conversation - system automatically extracts memories"""
    
    # Store in working memory with extraction strategy
    working_memory = WorkingMemory(
        session_id=session_id,
        messages=[
            MemoryMessage(role="user", content=user_message),
            MemoryMessage(role="assistant", content=assistant_message)
        ],
        user_id=user_id,
        long_term_memory_strategy=MemoryStrategyConfig(
            strategy="discrete",  # Extract individual facts
            config={}
        )
    )
    
    await client.set_working_memory(session_id, working_memory)
    
    # System automatically:
    # 1. Analyzes conversation for important information
    # 2. Extracts structured memories (preferences, facts, events)
    # 3. Applies contextual grounding (resolves pronouns)
    # 4. Stores in long-term memory
    # 5. Deduplicates similar memories

Advantages: Zero overhead, continuous learning, contextual grounding, scales naturally
Disadvantages: Less control, delayed availability, potential noise

Memory Extraction Strategies

The system offers four extraction strategies for background processing:

Strategy	Purpose	Best For
Discrete (default)	Extract individual facts	General-purpose agents
Summary	Create conversation summaries	Meeting notes, long conversations
Preferences	Focus on user preferences	Personalization systems
Custom	Domain-specific extraction	Technical, legal, medical domains

Example: Custom Strategy for Technical Agent

tech_strategy = MemoryStrategyConfig(
    strategy="custom",
    config={
        "custom_prompt": """
        Extract technical decisions from: {message}
        Focus on:
        - Technology choices made
        - Architecture decisions
        - Implementation details
        Return JSON with memories array containing type, text, topics, entities.
        """
    }
)

working_memory = WorkingMemory(
    session_id="tech_consult",
    messages=[
        MemoryMessage(
            role="user", 
            content="Let's use PostgreSQL for the database and Redis for caching"
        )
    ],
    long_term_memory_strategy=tech_strategy
)

Combining Patterns: The Hybrid Approach

Most production systems use multiple patterns together:

class ProductionAgent:
    """Combines code-driven retrieval with background extraction"""
    
    async def chat(self, user_message: str, user_id: str, session_id: str) -> str:
        # 1. Code-driven: Get relevant context
        context = await self.memory.memory_prompt(
            query=user_message,
            session_id=session_id,
            long_term_search={
                "text": user_message,
                "filters": {"user_id": {"eq": user_id}},
                "limit": 5
            }
        )
        
        # 2. Generate response with context
        response = await self.generate_response(context.messages, user_message)
        
        # 3. Background: Store for automatic extraction
        await self.memory.set_working_memory(
            session_id,
            WorkingMemory(
                messages=[
                    MemoryMessage(role="user", content=user_message),
                    MemoryMessage(role="assistant", content=response)
                ],
                user_id=user_id,
                long_term_memory_strategy=MemoryStrategyConfig(
                    strategy="discrete",
                    config={}
                )
            )
        )
        
        return response

Pattern selection guide:

Start with Code-Driven for predictable results
Add Background Extraction for continuous learning
Consider LLM Tools when conversational control is important

Quick Check

Memory Scoping: A user asks “What did we discuss yesterday about the project?” Which memory system do you query and why?
Pattern Selection: You’re building a customer support bot that needs to remember user preferences and avoid redundant API calls within a conversation. Which integration pattern(s) should you use?

Key Takeaways

Two-tier architecture - Working memory for sessions, long-term memory for knowledge
Choose integration patterns based on control needs (LLM-driven vs. code-driven vs. background)
Memory strategies matter - Discrete, summary, preferences, or custom extraction
Production systems typically use hybrid approaches combining multiple patterns
Semantic search enables intelligent retrieval beyond keyword matching

Home

Context Engineering & Prompt Design

Retrieval Augmented Generation (RAG)

AI Agents

Agent Reliability & Optimization

Multi-Agent Systems & Coordination

The Memory Problem in Production Agents

Memory Architecture: Two-Tier System

Working Memory: Session-Scoped State

Long-Term Memory: Cross-Session Knowledge

The Three Integration Patterns

Pattern 1: LLM-Driven Memory (Tool-Based)

Pattern 2: Code-Driven Memory (Programmatic)

Pattern 3: Background Extraction (Automatic)

Memory Extraction Strategies

Combining Patterns: The Hybrid Approach

Quick Check

Key Takeaways

Home

Context Engineering & Prompt Design

Retrieval Augmented Generation (RAG)

AI Agents

Agent Reliability & Optimization

Multi-Agent Systems & Coordination

​The Memory Problem in Production Agents

​Memory Architecture: Two-Tier System

​Working Memory: Session-Scoped State

​Long-Term Memory: Cross-Session Knowledge

​The Three Integration Patterns

​Pattern 1: LLM-Driven Memory (Tool-Based)

​Pattern 2: Code-Driven Memory (Programmatic)

​Pattern 3: Background Extraction (Automatic)

​Memory Extraction Strategies

​Combining Patterns: The Hybrid Approach

​Quick Check

​Key Takeaways

The Memory Problem in Production Agents

Memory Architecture: Two-Tier System

Working Memory: Session-Scoped State

Long-Term Memory: Cross-Session Knowledge

The Three Integration Patterns

Pattern 1: LLM-Driven Memory (Tool-Based)

Pattern 2: Code-Driven Memory (Programmatic)

Pattern 3: Background Extraction (Automatic)

Memory Extraction Strategies

Combining Patterns: The Hybrid Approach

Quick Check

Key Takeaways