Building Claude agents with persistent memory: stateful conversations without a database

Q: How do I handle users who don't want their data stored?

Pass a persist=False flag to your chat() function that skips the save_session() call and uses an empty UserMemory object. The conversation is still stateful within a single session — you just don't write to disk. This is also the right pattern for GDPR deletion requests: delete the session file, and all memory is gone.

By the end of this tutorial, you’ll have a working Python implementation of a Claude agents persistent memory system that maintains user context and conversation history across completely separate API calls — no database, no Redis, no vector store required. Just structured prompt engineering and a dead-simple session file.

The core insight most tutorials miss: Claude doesn’t need to “remember” anything. You need to reconstruct context efficiently at the start of each conversation. Once you internalize that, the architecture becomes obvious.

Set up the project — Install the Anthropic SDK and scaffold the session manager
Build the memory schema — Design a JSON structure that captures what actually matters
Implement session persistence — Save and load session state between runs
Inject memory into the system prompt — Reconstruct context without bloating token count
Extract and update memory after each turn — Have Claude summarize what’s worth keeping
Wire it into a stateful conversation loop — Test the full end-to-end flow

Why “no database” actually works at scale

The reflexive answer to persistent memory is “spin up Postgres” or “add a vector store.” That’s the right call for multi-user production systems. But for many real-world agent use cases — personal assistants, internal tools, single-tenant automations — it’s massive overkill that adds ops burden and latency.

Claude’s context window (200K tokens on Sonnet 3.5) means you can fit a surprising amount of structured history directly into the prompt. The trick is compression: instead of replaying raw conversation history, you maintain a distilled memory object that Claude itself helps update. This keeps injection costs predictable and avoids the prompt-bloat death spiral.

If you’re already thinking about the fuller architecture, we’ve covered the production-grade version with session sharding and database backends separately. This tutorial is deliberately scope-limited to the file-based approach so you can ship something in an hour.

Step 1: Set up the project

pip install anthropic python-dotenv

Create a .env file with your key:

ANTHROPIC_API_KEY=sk-ant-...

Scaffold your project:

mkdir persistent-agent && cd persistent-agent
touch agent.py memory.py session_manager.py

Step 2: Build the memory schema

Don’t try to remember everything. A memory object that captures the right things is far more effective than one that tries to log every turn.

# memory.py
import json
from dataclasses import dataclass, field, asdict
from typing import Optional
from datetime import datetime

@dataclass
class UserMemory:
    user_id: str
    name: Optional[str] = None
    preferences: dict = field(default_factory=dict)
    # Running summary of what's been discussed
    conversation_summary: str = ""
    # Key facts Claude has extracted — cap this at ~20 items
    key_facts: list = field(default_factory=list)
    # Last N raw messages for immediate context continuity
    recent_messages: list = field(default_factory=list)
    session_count: int = 0
    last_seen: str = field(default_factory=lambda: datetime.utcnow().isoformat())

    def to_json(self) -> str:
        return json.dumps(asdict(self), indent=2)

    @classmethod
    def from_json(cls, data: str) -> "UserMemory":
        return cls(**json.loads(data))

    def add_message(self, role: str, content: str, max_recent: int = 6):
        """Keep only the last N messages to bound token cost."""
        self.recent_messages.append({"role": role, "content": content})
        if len(self.recent_messages) > max_recent:
            self.recent_messages = self.recent_messages[-max_recent:]

The recent_messages cap is critical. Six messages (3 turns) keeps you under ~1,500 tokens for most conversations while preserving enough immediate context for coherent replies. The long-term continuity comes from conversation_summary and key_facts, not from replaying full history.

Step 3: Implement session persistence

# session_manager.py
import os
import json
from pathlib import Path
from memory import UserMemory

SESSIONS_DIR = Path("./sessions")
SESSIONS_DIR.mkdir(exist_ok=True)

def load_session(user_id: str) -> UserMemory:
    """Load existing session or create a new one."""
    session_file = SESSIONS_DIR / f"{user_id}.json"
    if session_file.exists():
        return UserMemory.from_json(session_file.read_text())
    return UserMemory(user_id=user_id)

def save_session(memory: UserMemory) -> None:
    """Persist the updated memory to disk."""
    from datetime import datetime
    memory.last_seen = datetime.utcnow().isoformat()
    session_file = SESSIONS_DIR / f"{memory.user_id}.json"
    session_file.write_text(memory.to_json())

In production you’d swap SESSIONS_DIR for S3, GCS, or a KV store — the interface stays identical. File I/O here costs microseconds and is fine for single-server deployments or serverless functions where you pass the session blob via event payload.

Step 4: Inject memory into the system prompt

This is where most implementations go wrong. Don’t just dump the JSON blob into the system prompt and hope for the best. Structure it so Claude can parse intent quickly — formatted context performs meaningfully better than raw JSON in my testing.

# agent.py (partial)
def build_system_prompt(memory: UserMemory) -> str:
    facts_block = "\n".join(f"- {f}" for f in memory.key_facts) or "None yet."

    return f"""You are a helpful assistant with memory of past interactions.

## What you know about this user
- Name: {memory.name or 'Unknown'}
- Session count: {memory.session_count}
- Last seen: {memory.last_seen}
- Preferences: {json.dumps(memory.preferences) if memory.preferences else 'None recorded'}

## Conversation history summary
{memory.conversation_summary or 'This is the first session.'}

## Key facts remembered
{facts_block}

## Behaviour instructions
1. Use the context above to personalise responses without explicitly narrating it back.
2. If the user tells you something new and important, note it mentally — you'll be asked to extract it later.
3. Do not repeat the user's history back to them unprompted. Just use it.
"""

Notice the explicit instruction not to narrate the memory back. Without that, agents develop an annoying habit of opening every response with “As you mentioned in our previous session…” — users hate it. For more on crafting system prompts that produce consistent agent behaviour, see this breakdown of high-performing Claude agent instructions.

Step 5: Extract and update memory after each turn

After each user message, run a lightweight extraction call to update the memory object. This is the key to keeping the memory useful rather than just large.

# agent.py (partial)
import anthropic
import json

client = anthropic.Anthropic()

def extract_memory_updates(memory: UserMemory, new_user_message: str, assistant_reply: str) -> UserMemory:
    """
    Ask a cheap, fast model to extract memory-worthy updates.
    Uses Claude Haiku — roughly $0.00025 per extraction at current pricing.
    """
    extraction_prompt = f"""Given this exchange, extract memory updates in JSON.

User said: {new_user_message}
Assistant replied: {assistant_reply}

Current summary: {memory.conversation_summary or 'None'}
Current key facts: {json.dumps(memory.key_facts)}

Return ONLY valid JSON with these fields:
{{
  "updated_summary": "...",   // concise running summary, max 3 sentences
  "new_facts": [],            // NEW facts only, strings, max 3 per turn
  "name": null,               // user's name if mentioned, else null
  "preferences": {{}}         // any new preference key-value pairs
}}"""

    response = client.messages.create(
        model="claude-haiku-4-5",  # cheap — don't burn Sonnet tokens on extraction
        max_tokens=400,
        messages=[{"role": "user", "content": extraction_prompt}]
    )

    raw = response.content[0].text.strip()
    # Strip markdown code fences if present
    if raw.startswith("```"):
        raw = raw.split("```")[1]
        if raw.startswith("json"):
            raw = raw[4:]

    updates = json.loads(raw)

    if updates.get("updated_summary"):
        memory.conversation_summary = updates["updated_summary"]
    if updates.get("new_facts"):
        memory.key_facts.extend(updates["new_facts"])
        memory.key_facts = memory.key_facts[-20:]  # cap at 20 facts
    if updates.get("name"):
        memory.name = updates["name"]
    if updates.get("preferences"):
        memory.preferences.update(updates["preferences"])

    return memory

Running extraction on Haiku keeps costs negligible — roughly $0.00025 per turn versus $0.003+ on Sonnet. Over 10,000 turns that’s the difference between $2.50 and $30 just for memory updates. The quality is fine for structured JSON extraction; you don’t need Sonnet for this task. This mirrors the kind of model-tier reasoning covered in the Haiku vs mini cost comparison.

Step 6: Wire it into a stateful conversation loop

# agent.py (complete main loop)
import anthropic
import json
from session_manager import load_session, save_session
from memory import UserMemory

client = anthropic.Anthropic()

def chat(user_id: str, user_message: str) -> str:
    memory = load_session(user_id)
    memory.session_count += 1

    system_prompt = build_system_prompt(memory)

    # Build message list: recent history + current message
    messages = list(memory.recent_messages)  # copy
    messages.append({"role": "user", "content": user_message})

    response = client.messages.create(
        model="claude-sonnet-4-5",
        max_tokens=1024,
        system=system_prompt,
        messages=messages
    )

    assistant_reply = response.content[0].text

    # Update rolling message window
    memory.add_message("user", user_message)
    memory.add_message("assistant", assistant_reply)

    # Extract and persist memory updates
    memory = extract_memory_updates(memory, user_message, assistant_reply)
    save_session(memory)

    return assistant_reply


if __name__ == "__main__":
    user_id = "user_demo_001"
    print("Chat with persistent memory. Type 'quit' to exit.\n")
    while True:
        msg = input("You: ").strip()
        if msg.lower() == "quit":
            break
        reply = chat(user_id, msg)
        print(f"Agent: {reply}\n")

Run it twice with the same user_id — kill the process between runs — and the agent will remember what was discussed. The session file in ./sessions/user_demo_001.json is the entire state store.

Common errors

1. JSON extraction fails with a parse error

Haiku occasionally wraps the JSON in markdown fences or adds a preamble sentence. The ``` strip above handles the common case, but you’ll hit edge cases. Add a fallback:

import re

def safe_parse_json(raw: str) -> dict:
    # Try to find a JSON object even if surrounded by text
    match = re.search(r'\{.*\}', raw, re.DOTALL)
    if match:
        return json.loads(match.group())
    raise ValueError(f"No JSON found in: {raw[:200]}")

For production-grade error handling patterns, see this guide on graceful degradation for Claude agents — the retry and fallback patterns there apply directly.

2. Memory grows stale or contradictory

If a user corrects themselves (“actually I prefer Python, not Ruby”), the old fact stays in key_facts. Quick fix: add a step to the extraction prompt asking Claude to flag and remove outdated facts. Or periodically run a full memory consolidation pass that rewrites key_facts from the full summary. Either works.

3. Token budget exceeded on long-running sessions

After ~50+ sessions, even a trimmed memory object can push your system prompt to 2K+ tokens. Set a hard cap: if len(system_prompt) > 3000 characters, truncate key_facts to the 10 most recent and rebuild. You can also run periodic summarization — collapse older summaries into a single paragraph — which is effectively RAG without the vector store.

What to build next

Add a user preference learning loop. After every 5 turns, run a second extraction pass that asks Claude to infer implicit preferences from the conversation pattern — preferred response length, technical depth, topic interests. Store these in memory.preferences and inject them as explicit instructions in the system prompt (“User prefers concise answers under 150 words”). This turns a stateful agent into one that genuinely adapts its style over time, not just its knowledge. Pair it with role prompting techniques and you have a surprisingly personalized assistant with zero UI changes.

Bottom line: when to use this approach

Use the file-based pattern if: you’re building a single-tenant tool, personal assistant, or internal automation where you control the deployment environment. It’s production-ready for low-to-medium concurrency (under ~100 concurrent users) and requires zero infrastructure beyond a shared filesystem or object storage.

Upgrade to a proper database when: you hit concurrency issues (file writes collide), need cross-device sync, or need querying capabilities (e.g., “find all users who mentioned X”). The memory schema here is trivially portable to Postgres, DynamoDB, or Firestore — the interface doesn’t change, just the persistence layer.

For solo founders and small teams: ship the file version first. You’ll learn what memory fields actually matter before over-engineering. Most agents in production that I’ve seen need far less memory than you’d expect — a 3-sentence summary and 10 facts gets you 80% of the personalization value.

The core principle behind solid Claude agents persistent memory isn’t storage technology — it’s disciplined compression. Keep only what changes behaviour, inject it cleanly, and let Claude do the rest.

Frequently Asked Questions

How much does running persistent memory extraction add to my API costs?

Using Claude Haiku for the extraction step costs roughly $0.00025 per turn at current pricing (input + output tokens for the extraction prompt). On Sonnet that jumps to ~$0.003 per turn. For 1,000 daily active users with 10 turns each, Haiku extraction adds about $2.50/day — essentially negligible. Always use a smaller model for structured extraction tasks like this.

Can I use this pattern with streaming responses?

Yes, but you need to buffer the full assistant response before running memory extraction, since you need the complete text. Stream the response to the user normally, accumulate it client-side or server-side, then fire the extraction call asynchronously after delivery. This adds ~200-400ms latency to the memory update but doesn’t affect the user-facing response time.

What happens if two concurrent requests try to update the same session file?

You’ll get a race condition — one write will overwrite the other. For the file-based approach, use file locking (fcntl.flock on Linux/Mac) or serialise writes through a queue. For anything beyond 10-20 concurrent users on the same session, this is the signal to move to a proper key-value store with atomic updates.

How do I handle users who don’t want their data stored?

Pass a persist=False flag to your chat() function that skips the save_session() call and uses an empty UserMemory object. The conversation is still stateful within a single session — you just don’t write to disk. This is also the right pattern for GDPR deletion requests: delete the session file, and all memory is gone.

How is this different from just passing the full conversation history to Claude?

Passing full history works for short sessions but breaks down fast: a 50-turn conversation can easily hit 20K+ tokens, costing $0.06+ per request on Sonnet and adding latency. The compression approach here caps injected context at roughly 500-1,000 tokens regardless of session length, making cost and latency predictable at scale while preserving the facts that actually matter.

Put this into practice

Try the Database Admin agent — ready to use, no setup required.

Browse Agents →

Editorial note: API pricing, model capabilities, and tool features change frequently — always verify current details on the vendor’s website before building in production. Code examples are tested at time of writing; pin your dependency versions to avoid breaking changes. Some links in this article may be affiliate links — we may earn a commission if you sign up, at no extra cost to you.

Building Claude agents with persistent memory: stateful conversations without a database

Claude MCP servers: complete setup guide for production tool integrations

Prompt token optimization: reducing LLM API costs without sacrificing quality

Building Claude agents with persistent memory: architecture for multi-session state management

Stacking multiple Claude models in a single workflow: when to use Haiku vs Sonnet vs Opus

Building Claude agents with Starlette 1.0: modern Python web framework integration

Holotron-12B for computer use agents: building high-throughput vision-based automation

Building Claude agents with persistent memory: stateful conversations without a database

Why “no database” actually works at scale

Step 1: Set up the project

Step 2: Build the memory schema

Step 3: Implement session persistence

Step 4: Inject memory into the system prompt

Step 5: Extract and update memory after each turn

Step 6: Wire it into a stateful conversation loop

Common errors

1. JSON extraction fails with a parse error

2. Memory grows stale or contradictory

3. Token budget exceeded on long-running sessions

What to build next

Bottom line: when to use this approach

Frequently Asked Questions

How much does running persistent memory extraction add to my API costs?

Can I use this pattern with streaming responses?

What happens if two concurrent requests try to update the same session file?

How do I handle users who don’t want their data stored?

How is this different from just passing the full conversation history to Claude?

Put this into practice

Related Claude Code Agents

Related Posts

Claude MCP servers: complete setup guide for production tool integrations

Prompt token optimization: reducing LLM API costs without sacrificing quality

Building Claude agents with persistent memory: architecture for multi-session state management

Stacking multiple Claude models in a single workflow: when to use Haiku vs Sonnet vs Opus

Building Claude agents with Starlette 1.0: modern Python web framework integration

Holotron-12B for computer use agents: building high-throughput vision-based automation