Most Claude agent tutorials stop at the API call. You send a message, you get a response, the conversation ends. Run it again tomorrow and your agent has no idea who you are or what you discussed. For anything beyond a demo, that’s a non-starter. Claude agent memory implementation is one of those problems that sounds simple until you’re actually building it — and then you realise there are five different ways to do it and four of them are overkill for what you need.
This article shows you how to build agents that remember context across sessions using nothing but local files and a bit of state management discipline. No Redis, no vector database, no hosted memory service. Just a pattern that works in production and can be extended when you actually need more infrastructure.
Why Most Agents Are Stateless (And Why That’s a Problem)
Claude’s API is stateless by design. Every call to /v1/messages is independent — there’s no session object sitting on Anthropic’s servers keeping track of what your user said last Tuesday. That’s actually good for reliability and scaling, but it means persistence is entirely your responsibility.
The naive fix is to dump the entire conversation history into every API call. That works for short sessions. It fails when you hit the context window limit, when your token costs balloon, or when your agent needs to remember something from six weeks ago that isn’t in the last 20 messages. You need a smarter approach.
The file-based approach I’m showing here solves for a specific tier of use case: agents running on a single machine or small server, where you control the filesystem, and where simplicity of deployment matters more than horizontal scaling. Solo founders, small teams, internal tools — this is the sweet spot.
The Architecture: Two Files, One Pattern
The core idea is to split memory into two layers:
- Short-term memory: The last N messages in the current session, stored as a rolling conversation buffer
- Long-term memory: A structured summary of key facts, preferences, and outcomes from past sessions, stored separately
Both are JSON files. The agent reads them at startup, uses them to construct its system prompt and conversation context, and writes updates back at the end of each session (or at defined checkpoints for long-running agents).
Here’s the file structure:
agent_state/
conversation_buffer.json # Last N messages (rolling window)
long_term_memory.json # Extracted facts and user preferences
session_log.jsonl # Append-only record of all sessions
The session_log.jsonl is append-only and you’ll mostly ignore it during operation — it’s your audit trail if something goes wrong or if you later want to train on real usage data.
Building the Memory Manager
Let’s start with the class that handles reading and writing state:
import json
import os
from datetime import datetime
from pathlib import Path
class AgentMemory:
def __init__(self, state_dir: str = "agent_state", buffer_size: int = 20):
self.state_dir = Path(state_dir)
self.state_dir.mkdir(exist_ok=True)
self.buffer_size = buffer_size
self.buffer_path = self.state_dir / "conversation_buffer.json"
self.ltm_path = self.state_dir / "long_term_memory.json"
self.log_path = self.state_dir / "session_log.jsonl"
self.conversation_buffer = self._load_json(self.buffer_path, default=[])
self.long_term_memory = self._load_json(self.ltm_path, default={
"user_preferences": [],
"key_facts": [],
"past_outcomes": [],
"last_updated": None
})
def _load_json(self, path: Path, default):
if path.exists():
with open(path, "r") as f:
return json.load(f)
return default
def add_message(self, role: str, content: str):
"""Add a message to the rolling buffer."""
self.conversation_buffer.append({
"role": role,
"content": content,
"timestamp": datetime.utcnow().isoformat()
})
# Keep only the last N messages to control token usage
if len(self.conversation_buffer) > self.buffer_size:
self.conversation_buffer = self.conversation_buffer[-self.buffer_size:]
def get_messages_for_api(self) -> list:
"""Return messages in the format Claude's API expects (no timestamps)."""
return [
{"role": m["role"], "content": m["content"]}
for m in self.conversation_buffer
]
def update_long_term_memory(self, facts: list = None,
preferences: list = None,
outcomes: list = None):
"""Call this after sessions to store important extracted information."""
if facts:
self.long_term_memory["key_facts"].extend(facts)
# Deduplicate roughly — good enough for most use cases
self.long_term_memory["key_facts"] = list(
dict.fromkeys(self.long_term_memory["key_facts"])
)[-50:] # Cap at 50 facts
if preferences:
self.long_term_memory["user_preferences"].extend(preferences)
self.long_term_memory["user_preferences"] = list(
dict.fromkeys(self.long_term_memory["user_preferences"])
)[-30:]
if outcomes:
self.long_term_memory["past_outcomes"].extend(outcomes)
self.long_term_memory["past_outcomes"] = \
self.long_term_memory["past_outcomes"][-20:] # Most recent 20
self.long_term_memory["last_updated"] = datetime.utcnow().isoformat()
def save(self):
"""Persist all state to disk."""
with open(self.buffer_path, "w") as f:
json.dump(self.conversation_buffer, f, indent=2)
with open(self.ltm_path, "w") as f:
json.dump(self.long_term_memory, f, indent=2)
def log_session(self, session_data: dict):
"""Append a session record to the audit log."""
with open(self.log_path, "a") as f:
f.write(json.dumps({
"timestamp": datetime.utcnow().isoformat(),
**session_data
}) + "\n")
def build_system_prompt_context(self) -> str:
"""Generate the memory section for the system prompt."""
ltm = self.long_term_memory
sections = []
if ltm["key_facts"]:
facts_str = "\n".join(f"- {f}" for f in ltm["key_facts"])
sections.append(f"Known facts about the user:\n{facts_str}")
if ltm["user_preferences"]:
prefs_str = "\n".join(f"- {p}" for p in ltm["user_preferences"])
sections.append(f"User preferences:\n{prefs_str}")
if ltm["past_outcomes"]:
outcomes_str = "\n".join(f"- {o}" for o in ltm["past_outcomes"][-5:])
sections.append(f"Recent relevant outcomes:\n{outcomes_str}")
return "\n\n".join(sections) if sections else "No long-term memory yet."
Wiring Memory Into Your Claude Agent
Now here’s how you use this in an actual agent loop. The critical part is injecting long-term memory into the system prompt, not into the message history — that keeps it always-present and doesn’t waste context window space on old conversation turns:
import anthropic
def run_agent_session(user_input: str, memory: AgentMemory) -> str:
client = anthropic.Anthropic() # Uses ANTHROPIC_API_KEY env var
# Build system prompt that includes long-term memory context
system_prompt = f"""You are a helpful assistant with persistent memory.
## What you remember about this user:
{memory.build_system_prompt_context()}
When you learn something important about the user — a preference, a key fact,
or the outcome of an action — make note of it. At the end of your response,
if you've learned something worth remembering, add a JSON block like:
<memory_update>
{{"facts": ["user works at a fintech startup"],
"preferences": ["prefers concise responses"],
"outcomes": []}}
</memory_update>
Only include this block when there's genuinely something new to remember.
"""
# Add the new user message to the buffer
memory.add_message("user", user_input)
response = client.messages.create(
model="claude-haiku-4-5", # ~$0.0008 per 1K input tokens — cheap for stateful loops
max_tokens=1024,
system=system_prompt,
messages=memory.get_messages_for_api()
)
assistant_message = response.content[0].text
# Parse out any memory updates Claude flagged
if "<memory_update>" in assistant_message:
import re
match = re.search(r'<memory_update>(.*?)</memory_update>',
assistant_message, re.DOTALL)
if match:
try:
update_data = json.loads(match.group(1).strip())
memory.update_long_term_memory(
facts=update_data.get("facts", []),
preferences=update_data.get("preferences", []),
outcomes=update_data.get("outcomes", [])
)
# Strip the memory block from the visible response
assistant_message = re.sub(
r'\s*<memory_update>.*?</memory_update>', '',
assistant_message, flags=re.DOTALL
).strip()
except json.JSONDecodeError:
pass # Malformed block — skip it, don't crash
memory.add_message("assistant", assistant_message)
memory.save() # Persist after every turn for safety
return assistant_message
Why I Use Claude Haiku Here
For the stateful loop itself — where you’re making frequent API calls and most turns don’t require deep reasoning — Haiku is the right call. At roughly $0.0008 per 1K input tokens and $0.004 per 1K output tokens, a typical 500-token exchange costs about $0.003. Run this 1,000 times and you’re spending $3. Use Sonnet or Opus for the memory extraction pass if you want higher accuracy on what’s worth remembering.
The Memory Extraction Problem
Asking Claude to self-report memory updates inline works surprisingly well, but it has two failure modes you should know about:
Over-reporting: Claude sometimes tags things as memory-worthy when they’re trivial. The user mentioned they like coffee once — does that go in long-term memory? Set your system prompt expectations clearly (“only store facts that would meaningfully change how you respond to this user in future sessions”).
Under-reporting: Claude misses something important because the conversation didn’t signal it explicitly. For high-stakes agents, run a separate extraction pass at session end using a targeted prompt:
def extract_session_memories(conversation: list, client) -> dict:
"""Run a dedicated extraction pass at end of session."""
conversation_text = "\n".join(
f"{m['role'].upper()}: {m['content']}"
for m in conversation[-10:] # Last 10 messages
)
extraction_prompt = f"""Review this conversation and extract anything worth
remembering for future sessions. Be selective — only things that would genuinely
change how an assistant should respond to this person.
Conversation:
{conversation_text}
Return JSON only:
{{"facts": [], "preferences": [], "outcomes": []}}"""
response = client.messages.create(
model="claude-haiku-4-5",
max_tokens=512,
messages=[{"role": "user", "content": extraction_prompt}]
)
try:
return json.loads(response.content[0].text)
except json.JSONDecodeError:
return {"facts": [], "preferences": [], "outcomes": []}
What Breaks in Production
I want to be direct about the failure modes before you build on this:
- Concurrent access: If two agent processes write to the same state files simultaneously, you’ll corrupt your JSON. Use file locking (
fcntlon Unix, or thefilelocklibrary cross-platform) if you have any concurrent execution. - Memory drift: Over time, the long-term memory fills with contradictory or stale facts. Add a periodic cleanup pass that asks Claude to consolidate and deduplicate the memory store.
- Context injection cost: If your long-term memory grows to 200 facts, you’re adding ~1,000 tokens to every API call. Cap your lists aggressively and summarise rather than append.
- No encryption: These files are plaintext. If you’re storing anything sensitive, encrypt at rest before shipping this pattern anywhere user data lives.
When to Graduate to a Real Database
File-based state is genuinely sufficient until it isn’t. Here’s my honest take on when to switch:
- Multiple users: As soon as you have more than one user, separate their state directories. Once you have hundreds of users, SQLite (still no server required) beats per-user JSON files.
- Semantic search over memories: When “find relevant memories” means more than a string match, you want a vector store. Chroma or LanceDB are lightweight enough to self-host.
- Multi-machine deployment: The moment your agent runs on more than one server, you need a shared store. Redis is the obvious call here — low latency, simple data structures, easy to set up.
For a single-user personal assistant, an internal ops tool, or a prototype you’re validating — the file-based pattern above will carry you further than you think.
Bottom Line: Who Should Use This Pattern
Solo founders and indie developers building personal AI tools or single-user agents: use this pattern exactly as shown. It’s zero infrastructure, zero cost, and ships in an afternoon.
Small teams building internal tools: use this with the SQLite upgrade and file locking. Still no server required, handles dozens of users, and you can migrate to Postgres later without changing your application logic.
Anyone building a customer-facing product: use this to prototype and validate that persistent memory actually improves your product’s outcomes. Then migrate to a proper store once you know the pattern works for your use case.
The key insight behind this Claude agent memory implementation is that most agents don’t need sophisticated memory infrastructure — they need a clear separation between short-term conversation context and long-term extracted knowledge, and a disciplined write pattern that keeps both stores from growing unbounded. Get that right and you can build agents that feel genuinely continuous without touching a database.
Editorial note: API pricing, model capabilities, and tool features change frequently — always verify current details on the vendor’s website before building in production. Code examples are tested at time of writing; pin your dependency versions to avoid breaking changes. Some links in this article may be affiliate links — we may earn a commission if you sign up, at no extra cost to you.

