Saturday, March 21

If you’re running an LLM-powered agent in production and haven’t implemented LLM caching response strategies, you’re almost certainly burning money on identical or near-identical API calls. I’ve seen agents making the same system prompt + query combination dozens of times per hour, paying full price every single time. A well-implemented caching layer routinely cuts that bill by 30–50%, sometimes more — and the implementation is less complex than most people assume.

This guide covers three distinct caching approaches: Anthropic’s native prompt caching (which works differently than most people think), semantic caching for fuzzy query matching, and TTL-based response caching for deterministic outputs. Each has a different cost profile and failure mode. I’ll show you working implementations and tell you exactly when each one breaks.

Why LLM API Costs Compound Faster in Agent Architectures

Single-turn chatbots are forgiving. Agents aren’t. A typical agent loop involves a large system prompt (tool definitions, persona, instructions) sent on every invocation — often 2,000–4,000 tokens before the user’s query even arrives. At Claude 3.5 Sonnet pricing (~$3 per million input tokens), a 3,000-token system prompt sent 10,000 times per day costs roughly $90/day in system prompt tokens alone, before you’ve processed a single user message.

The pattern that kills budgets: high-frequency queries with low variance. A customer support agent where 60% of tickets ask variations of “how do I reset my password” is a perfect caching candidate. Without caching, you pay full inference cost every time. With the right strategy, you pay once and serve from cache for hours.

Strategy 1: Anthropic’s Prompt Caching (The Lowest-Hanging Fruit)

Anthropic’s prompt caching is a server-side feature that caches the KV (key-value) computation for specified prompt prefixes. You’re not caching the response — you’re caching the expensive transformer computation for a stable prefix, so new requests that share that prefix skip recomputing it. The pricing: cached tokens cost $0.30 per million (vs $3 for uncached input tokens on Sonnet) — a 90% reduction on those tokens.

The catch most tutorials skip: the cache has a 5-minute TTL by default (extendable to 1 hour on some tiers), and the cached prefix must be at least 1,024 tokens. If your system prompt is 800 tokens, you won’t hit the threshold. Also, cache hits aren’t guaranteed — Anthropic doesn’t expose cache hit rates directly in the API response (though `cache_read_input_tokens` in the usage object tells you what was read from cache).

Here’s the correct way to implement it:

import anthropic

client = anthropic.Anthropic()

# Your large, stable system prompt — must be 1024+ tokens to qualify
SYSTEM_PROMPT = """You are a customer support agent for Acme Corp...
[your full system prompt here — needs to be substantial]
"""

def query_with_prompt_cache(user_message: str, conversation_history: list) -> dict:
    response = client.messages.create(
        model="claude-sonnet-4-5",
        max_tokens=1024,
        system=[
            {
                "type": "text",
                "text": SYSTEM_PROMPT,
                "cache_control": {"type": "ephemeral"}  # Mark this block for caching
            }
        ],
        messages=conversation_history + [
            {"role": "user", "content": user_message}
        ]
    )
    
    usage = response.usage
    cache_savings = usage.cache_read_input_tokens  # Tokens served from cache
    
    return {
        "content": response.content[0].text,
        "cache_read_tokens": usage.cache_read_input_tokens,
        "cache_write_tokens": usage.cache_creation_input_tokens,
        "uncached_input_tokens": usage.input_tokens,
    }

The first call will show `cache_creation_input_tokens` populated — that’s the “write” cost. Subsequent calls within the TTL window will show `cache_read_input_tokens` instead. If you’re not seeing cache reads, your system prompt is under the token threshold or requests are spaced too far apart.

When Prompt Caching Doesn’t Help

If your system prompt is dynamic — injecting user-specific context, timestamps, or frequently changing tool schemas into it — prompt caching breaks down. The cache key is based on the exact byte sequence of the prefix, so any variation means a cache miss. Keep your cacheable prefix static, and append dynamic content after the cache break point.

Strategy 2: Semantic Caching for Near-Duplicate Queries

Prompt caching handles the system prompt. Semantic caching handles the user query side. The idea: embed incoming queries, store past (query, response) pairs with their embeddings, and serve cached responses when a new query is close enough in vector space. “How do I reset my password?” and “I can’t log in, forgot my password” should hit the same cache entry.

I use Redis with the `redis-py` client and OpenAI’s `text-embedding-3-small` for embeddings (cheaper than alternatives at $0.02 per million tokens). The similarity threshold is where you’ll spend most of your tuning time — too low and you serve wrong answers, too high and your hit rate collapses.

import numpy as np
import json
import hashlib
from openai import OpenAI
import redis
from datetime import timedelta

openai_client = OpenAI()
redis_client = redis.Redis(host='localhost', port=6379, decode_responses=True)

SIMILARITY_THRESHOLD = 0.92  # Tune this carefully — 0.95 is safer, 0.88 gets more hits but more errors
CACHE_TTL = 3600  # 1 hour in seconds
EMBEDDING_MODEL = "text-embedding-3-small"

def get_embedding(text: str) -> list[float]:
    response = openai_client.embeddings.create(
        model=EMBEDDING_MODEL,
        input=text
    )
    return response.data[0].embedding

def cosine_similarity(a: list, b: list) -> float:
    a, b = np.array(a), np.array(b)
    return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))

def get_cached_response(query: str) -> str | None:
    query_embedding = get_embedding(query)
    
    # Scan existing cache keys — use a dedicated Redis index in production
    for key in redis_client.scan_iter("semantic_cache:*"):
        cached = json.loads(redis_client.get(key))
        similarity = cosine_similarity(query_embedding, cached["embedding"])
        
        if similarity >= SIMILARITY_THRESHOLD:
            print(f"Cache hit — similarity: {similarity:.3f}")
            return cached["response"]
    
    return None

def cache_response(query: str, response: str):
    embedding = get_embedding(query)
    cache_key = f"semantic_cache:{hashlib.md5(query.encode()).hexdigest()}"
    
    redis_client.setex(
        cache_key,
        timedelta(seconds=CACHE_TTL),
        json.dumps({"embedding": embedding, "response": response, "query": query})
    )

def query_with_semantic_cache(query: str, llm_fn) -> str:
    cached = get_cached_response(query)
    if cached:
        return cached
    
    # Cache miss — call the LLM
    response = llm_fn(query)
    cache_response(query, response)
    return response

Production warning: The naive `scan_iter` loop above doesn’t scale past a few thousand entries. In production, use a proper vector store — Redis Stack with vector similarity search, Pinecone, or Qdrant. The linear scan above will add 200ms+ latency at 10k+ entries, which defeats the purpose.

What Breaks With Semantic Caching

Semantic caching is genuinely dangerous for high-stakes or time-sensitive queries. “What’s the current status of my order?” should never be served from a semantic cache — the answer changes. Same with anything touching live data. I scope semantic caching strictly to queries where the answer is stable: product FAQs, documentation lookups, policy explanations. For anything stateful, use TTL-based caching or skip caching entirely.

Strategy 3: Exact-Match TTL Caching for Deterministic Pipelines

The simplest strategy is often the most effective: hash the exact inputs (system prompt + user message + model + temperature), cache the response with an expiry, and serve from cache on exact matches. No embeddings, no similarity math. Costs essentially nothing to implement and has zero false-positive risk.

import hashlib
import json
import redis
from datetime import timedelta

redis_client = redis.Redis(host='localhost', port=6379, decode_responses=True)

def make_cache_key(system_prompt: str, user_message: str, model: str, temperature: float) -> str:
    payload = json.dumps({
        "system": system_prompt,
        "user": user_message,
        "model": model,
        "temperature": temperature
    }, sort_keys=True)
    return f"llm_exact:{hashlib.sha256(payload.encode()).hexdigest()}"

def query_with_exact_cache(
    system_prompt: str,
    user_message: str,
    model: str = "claude-sonnet-4-5",
    temperature: float = 0.0,
    ttl_seconds: int = 3600,
    llm_fn = None
) -> dict:
    cache_key = make_cache_key(system_prompt, user_message, model, temperature)
    cached = redis_client.get(cache_key)
    
    if cached:
        result = json.loads(cached)
        result["from_cache"] = True
        return result
    
    # Call LLM and cache the result
    result = llm_fn(system_prompt, user_message, model, temperature)
    result["from_cache"] = False
    
    redis_client.setex(cache_key, timedelta(seconds=ttl_seconds), json.dumps(result))
    return result

Set temperature to 0 for anything you want to cache reliably. Non-zero temperature means the same input can produce different valid outputs — caching any one of them is technically lossy. For most agent tasks (classification, extraction, lookup), deterministic outputs are fine and often preferred.

Combining All Three Strategies: A Layered Cache Architecture

In practice, I run all three as a cascade. The lookup order matters:

  1. Exact-match cache check — cheapest to evaluate, zero risk of wrong answers
  2. Semantic cache check — catches near-duplicates, only for stable-answer query categories
  3. LLM call with prompt caching enabled — always cache the system prompt prefix regardless

This layered approach means you’re only paying embedding costs when exact-match misses, and only paying full LLM costs when both caches miss. In a production customer support agent I tuned last quarter, this cascade reduced LLM API calls by 47% over a 30-day window. The exact-match layer handled 31% alone — people really do ask identical questions repeatedly.

TTL Strategy: How Long Should You Cache?

Wrong TTL settings are the most common failure mode I see. People either cache too aggressively (serving stale answers after product updates) or too conservatively (near-zero cache hit rates). Some practical defaults:

  • Product FAQs / policy docs: 24 hours. Update cache proactively when docs change.
  • Classification tasks (sentiment, intent detection): 48–72 hours. The classification logic doesn’t drift.
  • Summarization of static documents: 7 days or until document hash changes.
  • Anything touching external APIs or live data: Don’t cache, or cache for 60–300 seconds max.
  • Prompt caching (Anthropic): Keep requests frequent enough to sustain the 5-minute default TTL. If your agent goes idle, the cache expires.

Build a cache invalidation trigger into your content pipeline. When your knowledge base updates, flush the relevant semantic cache entries. Serving confidently wrong answers from cache is worse than serving no answers at all.

Measuring What You’re Actually Saving

Instrument everything before you optimize. Track per-request: cache layer hit (exact/semantic/miss), input tokens, cached tokens, cost estimate. Without this, you’re guessing at ROI.

import time

def track_cache_metrics(fn):
    """Decorator to log cache performance per request."""
    def wrapper(*args, **kwargs):
        start = time.time()
        result = fn(*args, **kwargs)
        duration_ms = (time.time() - start) * 1000
        
        # Emit to your metrics system (Datadog, Prometheus, whatever)
        print(json.dumps({
            "cache_layer": result.get("cache_layer", "miss"),
            "from_cache": result.get("from_cache", False),
            "duration_ms": round(duration_ms, 1),
            "estimated_cost_usd": estimate_cost(result)
        }))
        return result
    return wrapper

def estimate_cost(result: dict) -> float:
    # Claude Sonnet 3.5 pricing at time of writing — verify current rates
    input_cost = (result.get("input_tokens", 0) / 1_000_000) * 3.00
    cached_cost = (result.get("cache_read_tokens", 0) / 1_000_000) * 0.30
    output_cost = (result.get("output_tokens", 0) / 1_000_000) * 15.00
    return round(input_cost + cached_cost + output_cost, 6)

Who Should Use Which Strategy

Solo founders / early-stage products: Start with exact-match caching only. It’s two hours of implementation, zero ongoing complexity, and you’ll likely hit 20–30% reduction immediately. Add Anthropic’s prompt caching if your system prompt is over 1,024 tokens — it’s just a header change. Skip semantic caching until you have the volume to justify tuning the similarity threshold.

Teams with established agent workflows: Layer in semantic caching once you’ve categorized your query types. Keep it scoped to stable-answer categories. Use a proper vector store from day one — don’t scale Redis scan into production pain.

High-volume enterprise deployments: All three layers, plus proactive cache warming for predictable query patterns, cache invalidation hooks in your content pipeline, and per-tenant cache namespacing if you’re multi-tenant. At scale, even a 5% improvement in cache hit rate is meaningful money.

The bottom line: LLM caching response strategies aren’t premature optimization — they’re basic infrastructure for any agent running at non-trivial volume. Start with the exact-match layer today, enable prompt caching on your system prompt, and measure for two weeks before deciding if semantic caching is worth the added complexity. Most teams get 30–40% cost reduction from just the first two, with no meaningful tradeoffs in output quality.

Editorial note: API pricing, model capabilities, and tool features change frequently — always verify current details on the vendor’s website before building in production. Code examples are tested at time of writing; pin your dependency versions to avoid breaking changes. Some links in this article may be affiliate links — we may earn a commission if you sign up, at no extra cost to you.

Share.
Leave A Reply