Saturday, March 21

Most Claude agent tutorials show you how to build something that works exactly once. You send a message, get a response, ship it — and then the user comes back tomorrow and the agent has the memory of a goldfish. If you’ve tried to build anything beyond a one-shot chatbot, you’ve already hit the wall that Claude agent memory between conversations creates. The good news: this is an architecture problem, not a model limitation, and it’s very solvable with the right stack.

This guide walks through a production-ready approach to persistent agent memory using a combination of vector storage (Pinecone or pgvector), structured session state, and a retrieval layer that feeds relevant context back into Claude without blowing your token budget. By the end, you’ll have a working pattern you can adapt for customer-facing apps, internal tools, or multi-session automation workflows.

Why Stateless Agents Break in Production

Claude’s context window is generous — up to 200K tokens on Claude 3.5 Sonnet — but dumping entire conversation histories into every request is wasteful and fragile. At roughly $3 per million input tokens for Sonnet, a session history that grows to 50K tokens costs $0.15 per request just in context. Multiply that by hundreds of daily users and you’re burning money on tokens that mostly don’t matter.

The real problem is worse than cost. Long contexts degrade response quality. Claude starts to “lose” early conversation details in very long prompts — not because of a bug, but because attention mechanisms work that way. Stuffing everything in and hoping for the best is not a memory strategy.

What you actually need is selective recall: store everything, retrieve what’s relevant, inject only what matters for the current turn. That’s the architecture we’re building.

The Three-Layer Memory Architecture

Think of persistent agent memory as three distinct stores with different retrieval patterns:

  • Episodic memory: Raw conversation turns — what the user said, what the agent replied, timestamps. This is your source of truth.
  • Semantic memory: Extracted facts and preferences — “user prefers Python over JavaScript”, “company budget is $50K”. Structured, queryable.
  • Working memory: The slice of context you inject into the current Claude request. Assembled from the other two layers.

Most tutorials only implement episodic memory (conversation history), which is why they don’t scale. The semantic layer is what makes your agent feel like it actually knows the user.

Storage Stack That Works in Production

For the episodic store, a simple Postgres table is fine. For semantic memory and fuzzy retrieval, you need vector search. I’d use pgvector if you’re already on Postgres (lower operational overhead), or Pinecone if you need managed scaling and don’t want to tune indexes. Pinecone’s serverless tier costs roughly $0.096 per million reads — negligible for most applications.

Building the Memory Store

Start with the database schema. You need two tables: one for raw turns, one for extracted memory fragments.

-- Raw conversation turns (episodic memory)
CREATE TABLE conversation_turns (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  session_id UUID NOT NULL,
  user_id UUID NOT NULL,
  role TEXT NOT NULL CHECK (role IN ('user', 'assistant')),
  content TEXT NOT NULL,
  created_at TIMESTAMPTZ DEFAULT NOW()
);

-- Extracted memory fragments (semantic memory)
CREATE TABLE memory_fragments (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  user_id UUID NOT NULL,
  content TEXT NOT NULL,          -- human-readable fact
  embedding VECTOR(1536),         -- pgvector column
  source_turn_id UUID REFERENCES conversation_turns(id),
  created_at TIMESTAMPTZ DEFAULT NOW()
);

CREATE INDEX ON memory_fragments USING ivfflat (embedding vector_cosine_ops)
  WITH (lists = 100);

The ivfflat index is important — without it, similarity search does a full table scan, which is fine for development but kills you at scale. Set lists to roughly sqrt(row_count) as a starting point.

Extracting Semantic Memory After Each Turn

After every assistant response, run a lightweight extraction pass with Claude Haiku (currently ~$0.00025 per 1K input tokens — about $0.002 per average extraction call) to pull out facts worth remembering:

import anthropic
import json

client = anthropic.Anthropic()

def extract_memory_fragments(user_message: str, assistant_response: str) -> list[str]:
    """
    Use Claude Haiku to extract durable facts from a conversation turn.
    Returns a list of plain-text memory fragments to store.
    """
    extraction_prompt = f"""Analyze this conversation exchange and extract any facts about the user
that would be useful to remember in future conversations. Focus on:
- Preferences and working style
- Technical context (languages, tools, stack)
- Business context (role, company, goals)
- Explicit requests or constraints they've mentioned

Return ONLY a JSON array of strings. If nothing notable is worth storing, return [].

User: {user_message}
Assistant: {assistant_response}"""

    response = client.messages.create(
        model="claude-haiku-4-5",
        max_tokens=512,
        messages=[{"role": "user", "content": extraction_prompt}]
    )

    try:
        fragments = json.loads(response.content[0].text)
        return fragments if isinstance(fragments, list) else []
    except (json.JSONDecodeError, IndexError):
        return []  # fail silently — extraction is best-effort

This runs asynchronously after each turn so it doesn’t add latency to the user-facing response. The cost is trivial; the payoff is an agent that learns without manual curation.

Retrieving Relevant Context at Request Time

When a new message comes in, you need to assemble the working memory before calling Claude Sonnet. The retrieval step has two parts: recent turns (recency bias) plus semantically relevant fragments (relevance bias).

from openai import OpenAI  # using OpenAI embeddings; swap for your provider
import psycopg2
from pgvector.psycopg2 import register_vector

embed_client = OpenAI()

def get_embedding(text: str) -> list[float]:
    response = embed_client.embeddings.create(
        model="text-embedding-3-small",
        input=text
    )
    return response.data[0].embedding

def retrieve_working_memory(
    user_id: str,
    current_message: str,
    db_conn,
    recent_turns: int = 6,
    semantic_fragments: int = 5
) -> dict:
    """
    Assembles working memory for injection into the Claude system prompt.
    Returns dict with 'recent_history' and 'relevant_facts'.
    """
    cur = db_conn.cursor()

    # 1. Grab the N most recent turns across all sessions for this user
    cur.execute("""
        SELECT role, content FROM conversation_turns
        WHERE user_id = %s
        ORDER BY created_at DESC
        LIMIT %s
    """, (user_id, recent_turns))
    recent = cur.fetchall()[::-1]  # reverse to chronological order

    # 2. Vector search for semantically relevant memory fragments
    query_embedding = get_embedding(current_message)
    cur.execute("""
        SELECT content, 1 - (embedding <=> %s::vector) AS similarity
        FROM memory_fragments
        WHERE user_id = %s
        ORDER BY embedding <=> %s::vector
        LIMIT %s
    """, (query_embedding, user_id, query_embedding, semantic_fragments))
    fragments = cur.fetchall()

    return {
        "recent_history": [{"role": r, "content": c} for r, c in recent],
        "relevant_facts": [f[0] for f in fragments if f[1] > 0.75]  # similarity threshold
    }

The similarity threshold of 0.75 is a starting point — tune it based on how noisy your fragments are. Below 0.7, you get garbage recall. Above 0.85, you miss relevant context. Log the scores during development and adjust.

Injecting Memory Into the Claude Request

def build_claude_request(
    user_message: str,
    working_memory: dict,
    base_system_prompt: str
) -> dict:
    """
    Constructs the full messages payload with injected memory context.
    """
    memory_block = ""

    if working_memory["relevant_facts"]:
        facts = "\n".join(f"- {f}" for f in working_memory["relevant_facts"])
        memory_block += f"\n\n<memory>\nWhat you know about this user:\n{facts}\n</memory>"

    system_prompt = base_system_prompt + memory_block

    # Build messages: inject recent history, then current message
    messages = working_memory["recent_history"] + [
        {"role": "user", "content": user_message}
    ]

    return {
        "model": "claude-sonnet-4-5",
        "max_tokens": 1024,
        "system": system_prompt,
        "messages": messages
    }

Using XML tags like <memory> in the system prompt is intentional. Claude is specifically trained to respect structured sections in prompts, and named blocks reduce hallucination around injected facts. Don’t just dump text in — structure it.

Handling Memory Decay and Conflicts

Left unchecked, your memory store accumulates garbage. Users change preferences, outdated facts contradict new ones, and retrieval gets noisy. You need a basic conflict-resolution and decay strategy.

The simplest approach: when inserting a new fragment, check cosine similarity against existing fragments for the same user. If a new fragment is >0.9 similar to an existing one, replace it rather than append. This prevents duplicate/contradictory facts from co-existing.

def upsert_memory_fragment(user_id: str, content: str, source_turn_id: str, db_conn):
    """
    Inserts a memory fragment, replacing near-duplicate entries to avoid conflicts.
    """
    cur = db_conn.cursor()
    new_embedding = get_embedding(content)

    # Check for existing near-duplicate
    cur.execute("""
        SELECT id FROM memory_fragments
        WHERE user_id = %s
          AND 1 - (embedding <=> %s::vector) > 0.9
        LIMIT 1
    """, (user_id, new_embedding))

    existing = cur.fetchone()

    if existing:
        # Update in place — new information supersedes old
        cur.execute("""
            UPDATE memory_fragments
            SET content = %s, embedding = %s, source_turn_id = %s, created_at = NOW()
            WHERE id = %s
        """, (content, new_embedding, source_turn_id, existing[0]))
    else:
        cur.execute("""
            INSERT INTO memory_fragments (user_id, content, embedding, source_turn_id)
            VALUES (%s, %s, %s, %s)
        """, (user_id, content, new_embedding, source_turn_id))

    db_conn.commit()

For long-lived agents (months of user data), also add a soft-deletion pass: any fragment not retrieved in 90 days is probably stale. Run this as a weekly cron, not inline.

What Actually Breaks in Production

Here’s what the documentation won’t tell you:

  • Embedding model drift: If you switch embedding models (e.g., from text-embedding-ada-002 to text-embedding-3-small), your existing vectors are no longer comparable. You must re-embed everything. Pin your embedding model version like a dependency.
  • Haiku extraction hallucinations: About 2–5% of extractions will contain fabricated facts, especially from ambiguous messages. Add a validation step or manually audit the first few hundred fragments when you launch.
  • Session boundary edge cases: Users who switch devices mid-session, or who have multiple concurrent sessions, will cause duplicate turn inserts. Deduplicate on (user_id, content_hash, created_at_minute).
  • pgvector index staleness: The IVFFlat index doesn’t auto-update. Run REINDEX INDEX periodically, or use HNSW indexes (pgvector 0.5+) which handle updates better at the cost of more memory.

Total Cost Estimate for a Real Workload

For a production app with 500 daily active users, averaging 10 turns per session:

  • Extraction calls (Haiku): 5,000 calls/day × ~$0.002 = ~$10/day
  • Embedding calls (text-embedding-3-small): 5,000/day × ~$0.00002 = ~$0.10/day
  • Retrieval calls (Claude Sonnet, ~8K tokens per request with memory): 5,000/day × ~$0.024 = ~$120/day
  • Pinecone serverless: ~$5–15/month depending on vector count

The Sonnet inference cost dominates, as it always will. Memory retrieval adds maybe 5–10% overhead on top of what you’d spend anyway — it’s not the line item to optimize.

When to Use This Pattern

Use this architecture if: you’re building a product where users return across multiple sessions and the agent’s value compounds with usage — coaching tools, AI assistants, customer success agents, internal knowledge tools.

Skip it if: you’re building one-shot query tools, document summarizers, or anything where each session is genuinely independent. You’re adding complexity and cost you don’t need.

Solo founders: Start with pgvector on your existing Postgres instance. Don’t spin up Pinecone until you have >100K vectors and query latency becomes an issue. The pgvector setup I’ve shown above handles millions of vectors fine on a $50/month RDS instance.

Teams building for enterprise: Add per-user memory isolation at the row level, an audit log of what was stored and why, and a user-facing “forget me” endpoint. GDPR compliance requires the ability to delete all memory fragments for a given user — design for that from day one, not as an afterthought.

Persistent Claude agent memory between conversations is not magic — it’s retrieval-augmented generation applied to user history. The pattern is mature, the tooling is good, and the failure modes are well-understood once you’ve shipped it. The code above is where I’d start on a new project today.

Editorial note: API pricing, model capabilities, and tool features change frequently — always verify current details on the vendor’s website before building in production. Code examples are tested at time of writing; pin your dependency versions to avoid breaking changes. Some links in this article may be affiliate links — we may earn a commission if you sign up, at no extra cost to you.

Share.
Leave A Reply