Every Claude agent you build is amnesiac by default. Each API call starts with a blank slate — no memory of the user who’s been talking to it for three weeks, no recall of preferences set two sessions ago, no continuity between runs. For simple one-shot tasks that’s fine. For anything resembling a real assistant, it’s a serious problem. Agent memory management is the difference between a toy demo and a tool people actually use.
The good news: you don’t need a vector database, a managed memory service, or a Redis cluster to fix this. For a surprising range of production workloads, flat files plus a few context-window tricks get you 80% of the way there with almost zero infrastructure. Here’s how to implement it properly.
Why Stateless Agents Break Real Workflows
Claude’s context window is enormous — up to 200k tokens for Claude 3.x models. It’s tempting to just stuff everything in there and call it memory. That approach fails in three predictable ways:
- Cost: Sending 50k tokens of history on every call at Sonnet pricing (~$3/MTok input) adds up fast. A busy agent doing 100 calls/day accumulates real money.
- Latency: Larger prompts mean slower responses, especially at higher token counts where processing isn’t linear.
- Drift: Claude’s attention isn’t uniformly distributed across 200k tokens. Information buried in the middle of a massive context gets attended to less reliably than information near the start or end.
The goal of agent memory management isn’t to send everything — it’s to send the right things. That requires a storage layer, even a simple one.
The File-Based Memory Pattern
For agents running on a single machine (a local automation, a small n8n instance, a personal assistant), the filesystem is a perfectly valid memory store. The pattern is straightforward: read relevant memory at the start of each invocation, update it at the end.
Basic Implementation
import json
import os
from datetime import datetime
from anthropic import Anthropic
client = Anthropic()
MEMORY_DIR = "./agent_memory"
def load_user_memory(user_id: str) -> dict:
"""Load persistent memory for a specific user."""
path = f"{MEMORY_DIR}/{user_id}.json"
if not os.path.exists(path):
return {
"user_id": user_id,
"preferences": {},
"facts": [],
"conversation_summary": "",
"last_seen": None,
"session_count": 0
}
with open(path, "r") as f:
return json.load(f)
def save_user_memory(user_id: str, memory: dict) -> None:
"""Persist updated memory to disk."""
os.makedirs(MEMORY_DIR, exist_ok=True)
memory["last_seen"] = datetime.utcnow().isoformat()
memory["session_count"] = memory.get("session_count", 0) + 1
with open(f"{MEMORY_DIR}/{user_id}.json", "w") as f:
json.dump(memory, f, indent=2)
def build_system_prompt(memory: dict) -> str:
"""Inject memory into the system prompt."""
memory_block = ""
if memory["conversation_summary"]:
memory_block += f"\n\nPrevious context:\n{memory['conversation_summary']}"
if memory["facts"]:
facts_str = "\n".join(f"- {f}" for f in memory["facts"][-10:]) # last 10 facts
memory_block += f"\n\nKnown facts about this user:\n{facts_str}"
if memory["preferences"]:
prefs_str = json.dumps(memory["preferences"], indent=2)
memory_block += f"\n\nUser preferences:\n{prefs_str}"
return f"""You are a helpful assistant with memory of past interactions.{memory_block}
At the end of each response, output a JSON block wrapped in <memory_update> tags.
This block should contain:
- "new_facts": list of important facts to remember (names, goals, constraints)
- "preferences": dict of any preferences mentioned
- "summary_update": one sentence summary of what was discussed this session"""
def extract_memory_update(response_text: str) -> dict | None:
"""Parse the structured memory update from Claude's response."""
import re
match = re.search(r'<memory_update>(.*?)</memory_update>', response_text, re.DOTALL)
if not match:
return None
try:
return json.loads(match.group(1).strip())
except json.JSONDecodeError:
return None
def clean_response(response_text: str) -> str:
"""Strip the memory block before showing response to user."""
import re
return re.sub(r'<memory_update>.*?</memory_update>', '', response_text, flags=re.DOTALL).strip()
The key design decision here: Claude writes its own memory updates. You’re not trying to parse the conversation externally — you’re asking Claude to identify what’s worth remembering as part of generating its response. This is more reliable than regex-scraping conversations, and it costs almost nothing extra in tokens.
Running a Memory-Aware Session
def run_agent_session(user_id: str, user_message: str) -> str:
"""Single invocation of the memory-aware agent."""
memory = load_user_memory(user_id)
system_prompt = build_system_prompt(memory)
response = client.messages.create(
model="claude-3-5-haiku-20241022",
max_tokens=1024,
system=system_prompt,
messages=[{"role": "user", "content": user_message}]
)
full_response = response.content[0].text
update = extract_memory_update(full_response)
if update:
# Merge new facts (deduplicate naively)
existing_facts = set(memory["facts"])
for fact in update.get("new_facts", []):
existing_facts.add(fact)
memory["facts"] = list(existing_facts)
# Merge preferences
memory["preferences"].update(update.get("preferences", {}))
# Update rolling summary
if update.get("summary_update"):
memory["conversation_summary"] = update["summary_update"]
save_user_memory(user_id, memory)
return clean_response(full_response)
# Usage
reply = run_agent_session("user_abc123", "I prefer responses in bullet points, by the way.")
print(reply)
# Later invocation — agent remembers the preference
reply2 = run_agent_session("user_abc123", "What are the main features of async Python?")
print(reply2) # Will format as bullets without being asked again
At Haiku pricing (~$0.0008/MTok input, $0.004/MTok output), a memory-aware session with a 500-token system prompt and 200-token user message costs roughly $0.0003 per call. Even at 1000 calls/day that’s $0.30. This is not a cost problem.
Handling Long Conversation History
The single-summary approach works until users have long, complex histories. When a user has 50 sessions with your agent, a one-sentence summary loses too much. You need a rolling compression strategy.
Tiered Memory Architecture
Borrow from how humans actually remember things: recent events in detail, older events compressed into narratives, core facts stored permanently.
def build_tiered_memory(memory: dict, recent_messages: list) -> str:
"""
Three tiers:
- Core facts: always included (names, goals, hard constraints)
- Rolling summary: compressed narrative of past sessions
- Recent messages: last N turns verbatim
"""
sections = []
# Tier 1: Core facts (always include, these are cheap)
if memory.get("facts"):
core = "\n".join(f"- {f}" for f in memory["facts"][:15])
sections.append(f"Core facts about this user:\n{core}")
# Tier 2: Compressed history (keep this short)
if memory.get("conversation_summary"):
sections.append(f"History summary:\n{memory['conversation_summary']}")
# Tier 3: Recent turns (last 3 exchanges verbatim)
if recent_messages:
recent_str = "\n".join(
f"{m['role'].capitalize()}: {m['content'][:200]}..." # truncate long turns
if len(m['content']) > 200 else f"{m['role'].capitalize()}: {m['content']}"
for m in recent_messages[-6:] # 3 user + 3 assistant
)
sections.append(f"Recent conversation:\n{recent_str}")
return "\n\n".join(sections)
Store recent messages in a separate list in the JSON file. After every 5 sessions, run a compression pass that asks Claude to summarize the recent messages into the rolling summary, then clear the recent messages list. This keeps your injected context under ~800 tokens for almost any use case.
When Files Aren’t Enough: Hybrid Patterns
File-based storage hits limits when you have multiple concurrent users, need to query across users, or run on serverless infrastructure where the filesystem isn’t persistent. You have a few good options before reaching for a full vector database.
SQLite as the Next Step Up
SQLite runs in-process, requires zero infrastructure, persists across restarts, and handles concurrent reads fine (writes serialize automatically). For under 10k users it outperforms most hosted solutions on latency because there’s no network round-trip.
import sqlite3
def init_db(db_path: str = "agent_memory.db"):
conn = sqlite3.connect(db_path)
conn.execute("""
CREATE TABLE IF NOT EXISTS user_memory (
user_id TEXT PRIMARY KEY,
facts TEXT DEFAULT '[]', -- JSON array
preferences TEXT DEFAULT '{}', -- JSON object
summary TEXT DEFAULT '',
session_count INTEGER DEFAULT 0,
last_seen TEXT
)
""")
conn.commit()
return conn
def load_memory_sql(conn, user_id: str) -> dict:
row = conn.execute(
"SELECT facts, preferences, summary, session_count FROM user_memory WHERE user_id = ?",
(user_id,)
).fetchone()
if not row:
return {"facts": [], "preferences": {}, "conversation_summary": "", "session_count": 0}
return {
"facts": json.loads(row[0]),
"preferences": json.loads(row[1]),
"conversation_summary": row[2],
"session_count": row[3]
}
For n8n and Make workflows, you can write to a SQLite file if you’re running self-hosted. On cloud-based automation platforms, use their built-in data stores — n8n’s Static Data node or Make’s Data Stores. They’re essentially managed key-value stores that serve the same purpose.
When to Actually Use a Vector Database
Use a vector database when you need semantic search across memory — when “find everything this user said about their budget” matters more than “show me the last 10 facts.” For most assistants and automation agents, you don’t need this. The tiered file approach with good tagging handles the common cases. Add vector search when you have 1000+ facts per user and need fuzzy retrieval, or when you’re building something explicitly designed around episodic memory recall.
Practical Failure Modes to Know About
This architecture has real failure modes worth understanding before you ship:
- Claude sometimes skips the memory update block. Especially on short or ambiguous turns. Always treat the memory update as optional, never assume it’s present. Add a fallback that logs when no update is found.
- Facts accumulate garbage. If a user corrects themselves (“actually I’m in London, not Paris”), you’ll have both facts. Periodically run a deduplication/reconciliation pass using Claude itself: “Here are the facts we have. Remove duplicates and contradictions.”
- The JSON in memory update blocks gets malformed. Claude occasionally produces slightly invalid JSON (trailing commas, unescaped quotes). Use a lenient parser or wrap extraction in try/except with a fallback to store the raw text for manual review.
- File writes aren’t atomic. If your agent crashes mid-write, you corrupt the memory file. Fix this with atomic writes: write to a temp file, then
os.replace()the target.os.replace()is atomic on Linux/macOS.
Integrating With n8n and Make
If you’re building agent memory management into no-code/low-code workflows, the same principles apply with different primitives. In n8n, use the Read/Write File nodes for flat file storage, or the Execute Code node with the SQLite npm package for database storage. Store the memory JSON in a workflow variable or pass it between nodes as a JSON object.
In Make, Data Stores handle key-value persistence natively. Use the user ID as the record key, store the memory object as the value. Read it at the start of your scenario, pass it into your Claude HTTP request as part of the system prompt, parse the memory update out of the response, and write it back. The whole pattern maps cleanly to Make’s module chain.
Choosing the Right Approach for Your Situation
Solo founder or small project with under 500 users: Start with the flat file pattern. It runs locally, costs nothing, and you can migrate to SQLite in an afternoon when you need to. Don’t over-engineer early.
Production SaaS with a real user base: Use SQLite if you’re on a single server, Postgres if you’re already running it. The memory JSON schema above maps directly to a JSONB column in Postgres. You get querying, backups, and concurrent access for free.
Serverless or edge functions: You need external storage from day one. Upstash Redis ($0 free tier, ~$0.2/100k commands) is the lowest-friction option. Use it as a key-value store with the same JSON schema.
n8n/Make automation users: Use native data stores for simplicity. If you’re on n8n self-hosted and want something more robust, the SQLite approach with an Execute Code node is worth the small setup cost.
The underlying principle of agent memory management doesn’t change across any of these: serialize context to a persistent store, inject only what’s relevant, let Claude maintain its own memory updates. The storage backend is just a deployment detail. Get the pattern right first, then pick the infrastructure that fits where you’re running.
Editorial note: API pricing, model capabilities, and tool features change frequently — always verify current details on the vendor’s website before building in production. Code examples are tested at time of writing; pin your dependency versions to avoid breaking changes. Some links in this article may be affiliate links — we may earn a commission if you sign up, at no extra cost to you.

