Sunday, April 5

Most teams treating LLM prompt caching cost as an afterthought are leaving 40–70% of their API spend on the table. That’s not a marketing number — that’s what you actually see when you instrument a production agent with a 2,000-token system prompt running 10,000 times a day. The cache hit either happens or it doesn’t, and the difference is the difference between a $180/day bill and a $60/day bill on Claude 3.5 Sonnet.

The problem is that most developers have a fuzzy mental model of how caching works at the API level. They know it exists, they maybe tick a box to enable it, and then they wonder why their cache hit rate is 12% instead of 90%. This article is about closing that gap — how the KV cache actually works, what breaks it, how to structure prompts to maximize hits, and when caching genuinely isn’t worth the effort.

What the KV Cache Actually Is (and Isn’t)

When an LLM processes a prompt, every token attends to every previous token via attention operations. The intermediate results of that computation — specifically the Key and Value matrices for each transformer layer — are called the KV cache. On repeated requests, if the prefix of the prompt is identical, the model can skip recomputing those matrices and jump straight to processing the new tokens.

This is fundamentally different from HTTP caching or memoization. You’re not caching the output — you’re caching the intermediate computation for the input prefix. Which means:

  • The prefix must be byte-for-byte identical (not semantically similar — literally identical)
  • The cache is positional — it only applies to the beginning of the prompt, not arbitrary segments
  • Changing a single character in your system prompt invalidates the entire cache for that position and everything after it
  • Dynamic content (timestamps, user names, session IDs) injected at the start of your prompt destroys the cache hit rate entirely

Provider Implementations Differ Significantly

Anthropic’s prompt caching on Claude requires explicit cache control markers — you opt in per message block using cache_control: {"type": "ephemeral"}. The cache TTL is 5 minutes by default, extendable. Minimum cacheable block is 1,024 tokens for Claude 3.5 Sonnet/Haiku, 2,048 for older models. Cache write costs 25% more than a standard input token; cache reads cost 10% of standard input price.

OpenAI’s caching on GPT-4o and newer models is automatic — no markers needed. They cache the longest matching prefix silently. You can detect cache hits by checking usage.prompt_tokens_details.cached_tokens in the response. Cache reads are 50% of standard input price (less aggressive discount than Anthropic).

Google’s Gemini uses “context caching” as an explicit API feature where you upload content and get a cache ID back. Minimum 32,768 tokens. You pay per token per hour of storage (~$1.00/million tokens/hour for Gemini 1.5 Pro). It’s architecturally different — more like a server-side document store than a KV cache — and works best for very large, stable contexts.

Real Cost Numbers: When Caching Actually Pays

Let’s work through a concrete scenario: a customer support agent built on Claude 3.5 Haiku with a 1,500-token system prompt (instructions, tone guidelines, product knowledge) running 50,000 requests per day.

Current Haiku pricing (as of mid-2025): $0.80/million input tokens, $4.00/million output tokens. Cache write: $1.00/million tokens. Cache read: $0.08/million tokens.

Without caching: 50,000 requests × 1,500 system prompt tokens = 75M input tokens/day → $60/day just on the system prompt portion.

With caching (90% hit rate):

  • 5,000 cache writes × 1,500 tokens = 7.5M tokens at $1.00/M = $7.50
  • 45,000 cache reads × 1,500 tokens = 67.5M tokens at $0.08/M = $5.40
  • Total for system prompt: $12.90/day

That’s a 78% reduction on the repeated-prefix portion. At this scale, caching pays for itself in hours. At lower volumes — say, 500 requests/day — you’re saving ~$0.60/day, which is trivial unless you’re on a shoestring budget.

The break-even math is straightforward: caching makes sense when (requests per day × system prompt tokens) is large enough that the savings exceed the operational overhead of managing prompt stability.

How to Structure Prompts for Maximum Cache Hits

This is where most developers lose their cache hits without realizing it. The rule is simple: stable content first, dynamic content last. In practice, violating this is extremely common.

The Prompt Ordering Pattern

import anthropic

client = anthropic.Anthropic()

# WRONG: dynamic content pollutes the cacheable prefix
bad_messages = [
    {
        "role": "user",
        "content": f"[Session: {session_id}] [Time: {timestamp}] {user_query}"
    }
]

# RIGHT: system prompt is fully stable and cacheable
response = client.messages.create(
    model="claude-haiku-4-5",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": STABLE_SYSTEM_PROMPT,  # 1500+ tokens of stable instructions
            "cache_control": {"type": "ephemeral"}
        }
    ],
    messages=[
        {
            "role": "user",
            # Dynamic content goes here, AFTER the cached prefix
            "content": f"User query: {user_query}\nSession context: {session_context}"
        }
    ]
)

# Check cache performance
usage = response.usage
hit = usage.cache_read_input_tokens > 0
print(f"Cache {'HIT' if hit else 'MISS'}: {usage.cache_read_input_tokens} cached, "
      f"{usage.cache_creation_input_tokens} written, {usage.input_tokens} uncached")

If you’re building multi-turn conversations, the same logic applies to conversation history. Mark the oldest, most stable turns with cache_control, and let the recent dynamic turns fall outside the cache boundary. This pairs well with the system prompts best practices we’ve covered elsewhere — structuring your instructions for consistency isn’t just about agent behavior, it’s directly tied to cache hit rates.

What Breaks Your Cache Without Warning

  • Timestamps in system prompts. “Today is {date}” at the top of your system prompt means zero cache hits. Move date context to the user message.
  • User-specific injection at the prefix. “You are helping {user_name}” as the first line. Move personalization to after the cacheable block.
  • Random few-shot examples. Shuffling example order between requests breaks the cache. Either fix the order or move examples into a dedicated cached block.
  • Whitespace/formatting inconsistency. Extra newlines, trailing spaces — any byte difference invalidates the match. Use a constant, not a string that’s being assembled at runtime.
  • Model version changes. Switching from claude-haiku-4-5 to claude-haiku-4-5-20251001 resets the cache. Pin your model version string as a constant.

Multi-Turn Agents and RAG: The Harder Caching Problems

For single-turn Q&A with a static system prompt, caching is straightforward. The interesting challenges appear in agents with conversation history and RAG pipelines.

Caching in Conversational Agents

In a multi-turn agent, the conversation history grows with each turn. The naive approach — marking the entire history as cacheable — doesn’t work well because the history changes every turn by definition. The practical pattern is to cache the system prompt (always static) and potentially cache a “frozen” portion of the conversation up to some checkpoint.

For agents with persistent memory — which we’ve covered in detail in our persistent memory architecture guide — you can serialize a stable “memory snapshot” into a cacheable block and keep only the current session’s live messages uncached.

Caching in RAG Pipelines

RAG introduces a complication: retrieved documents change per query. If you inject retrieved chunks between the system prompt and the user message, you’re adding dynamic content that breaks the cache boundary for everything below it — except Anthropic’s cache_control lets you mark multiple blocks. The pattern:

# Multiple cache control blocks for RAG
response = client.messages.create(
    model="claude-sonnet-4-5",
    max_tokens=2048,
    system=[
        {
            "type": "text",
            "text": STABLE_SYSTEM_PROMPT,
            "cache_control": {"type": "ephemeral"}  # Cache block 1
        }
    ],
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": STATIC_DOCUMENT_CORPUS,  # Large stable doc set
                    "cache_control": {"type": "ephemeral"}  # Cache block 2
                },
                {
                    "type": "text",
                    # Dynamic retrieved chunks + user query go here, uncached
                    "text": f"Relevant context:\n{retrieved_chunks}\n\nQuestion: {user_query}"
                }
            ]
        }
    ]
)

This is especially valuable when you have a large static knowledge base (documentation, policy docs) alongside dynamic retrieval. If you’re building RAG pipelines from scratch, designing the document injection strategy with cache boundaries in mind from day one saves significant refactoring later.

Three Misconceptions That Cost Developers Money

Misconception 1: “Cache hits are guaranteed after the first request.” They’re not. Anthropic’s default TTL is 5 minutes. If your request rate is low — say, one request per hour — you’re paying cache write costs on every single request and getting zero reads. Caching only pays off when your request rate meaningfully exceeds the TTL decay. For low-traffic use cases, caching adds cost, not savings.

Misconception 2: “Output caching and prompt caching are the same thing.” They’re completely different. KV cache / prompt caching saves compute on the input side. Semantic caching (storing full responses and retrieving them for similar future queries) caches the output. Some libraries like GPTCache implement output caching at the application layer, which is useful for truly repeated identical queries but useless for novel queries against a shared context. Know which one you’re implementing.

Misconception 3: “Longer cached prompts always mean more savings.” The discount is proportional to tokens, but the minimum thresholds matter. On Claude, you need 1,024 tokens to trigger caching. Padding a 500-token system prompt to 1,024 just to hit the threshold doesn’t make financial sense — you’re paying 25% extra on all those tokens hoping for reads that may not materialize.

Batch Processing: Caching’s Highest-ROI Use Case

If you’re running batch document processing — classification, extraction, summarization — this is where prompt caching generates the highest ROI with the least architectural complexity. The task instruction is always the same; only the document changes. Large-scale batch processing with the Claude API benefits enormously from a stable cached instruction block, with the document injected as uncached content per request.

At 10,000 document classifications with a 2,000-token instruction prompt on Haiku: without caching, that’s 20M instruction tokens at $0.80/M = $16. With caching, you write once (or a handful of times) and read 9,999+ times at $0.08/M = $1.60 for the instruction portion. That’s an $14.40 saving per 10,000 runs on just the instruction tokens — and instruction prompts for classification tasks are often much longer than 2,000 tokens.

Monitoring Cache Performance in Production

Instrumenting cache performance should be non-negotiable if you’re spending more than $50/month on API calls. The usage object gives you everything you need:

import json
from datetime import datetime

def log_cache_metrics(response, request_id: str):
    usage = response.usage
    
    total_input = (
        usage.input_tokens + 
        usage.cache_read_input_tokens + 
        usage.cache_creation_input_tokens
    )
    
    cache_hit_rate = (
        usage.cache_read_input_tokens / total_input 
        if total_input > 0 else 0
    )
    
    # Estimated cost (Haiku pricing)
    cost = (
        (usage.input_tokens * 0.80 / 1_000_000) +
        (usage.cache_creation_input_tokens * 1.00 / 1_000_000) +
        (usage.cache_read_input_tokens * 0.08 / 1_000_000) +
        (usage.output_tokens * 4.00 / 1_000_000)
    )
    
    metrics = {
        "request_id": request_id,
        "timestamp": datetime.utcnow().isoformat(),
        "uncached_tokens": usage.input_tokens,
        "cache_write_tokens": usage.cache_creation_input_tokens,
        "cache_read_tokens": usage.cache_read_input_tokens,
        "cache_hit_rate": round(cache_hit_rate, 3),
        "estimated_cost_usd": round(cost, 6)
    }
    
    # Ship to your observability stack
    print(json.dumps(metrics))
    return metrics

Track cache hit rate as a first-class metric alongside latency and cost. A sudden drop in cache hit rate — from 85% to 20% — almost always means someone changed the system prompt or introduced dynamic content into the cached prefix. This kind of observability pairs well with broader LLM monitoring setups; if you’re evaluating platforms, our comparison of Helicone, LangSmith, and Langfuse covers which tools expose cache metrics most usefully.

When Not to Bother With Prompt Caching

Caching isn’t always the right answer. Skip it when:

  • Your system prompt is under 1,024 tokens. You won’t hit the minimum threshold on Claude. Focus on prompt compression instead.
  • Your request rate is very low. Less than ~100 requests/day means cache TTL decay will eat most of your potential reads. Just pay full input price.
  • Every prompt is unique. If you’re doing one-off analysis with different documents and instructions each time, there’s nothing to cache. Consider whether you should be using Gemini’s context caching for large stable documents instead.
  • You’re using a model that doesn’t support it. Not all providers/models implement prompt caching. Check before designing around it.

Frequently Asked Questions

How do I know if my Claude API calls are actually hitting the cache?

Check the usage object in the response: cache_read_input_tokens will be greater than zero on a cache hit, and cache_creation_input_tokens will be greater than zero when a cache entry is being written. If both are zero, caching isn’t triggering — usually because the prompt prefix changed or your block is under the 1,024-token minimum.

Does prompt caching affect the quality or accuracy of responses?

No. The cached KV matrices are mathematically identical to what would have been computed from scratch. The model processes your prompt the same way — it just skips recomputing the attention keys and values for the cached prefix. Output quality is unaffected.

What is the difference between prompt caching and semantic caching?

Prompt caching (KV cache reuse) saves computation on the input side by reusing transformer intermediate states for identical prefixes — it still generates a fresh response. Semantic caching stores complete responses and retrieves them when a future query is semantically similar, bypassing the model entirely. Semantic caching is much more aggressive (and potentially incorrect) but can work well for truly repetitive FAQ-style queries.

Can I cache conversation history in multi-turn agents?

Yes, but strategically. You can mark earlier turns with cache_control to cache the accumulated history up to a stable checkpoint. The practical approach is to cache the system prompt always, cache a “frozen” history snapshot periodically, and leave only the current active exchange uncached. Be aware that every new turn still invalidates the cache for the latest position.

Does OpenAI’s automatic caching work the same way as Anthropic’s explicit cache_control?

Architecturally similar (both reuse KV states for matching prefixes), but the implementation differs. OpenAI caches automatically without markers and gives a 50% discount on cache reads. Anthropic requires explicit opt-in via cache_control blocks, charges 25% extra on cache writes, and gives a 90% discount on cache reads — more aggressive savings but requires deliberate prompt structuring. For high-volume stable prompts, Anthropic’s discount is better; for low-friction integration, OpenAI’s automatic caching requires zero code changes.

How long does a cached prompt stay valid on Claude?

The default TTL for Claude’s ephemeral cache is 5 minutes, which resets on each cache hit. Anthropic has indicated extended TTLs may be available for enterprise tiers. Practically, this means you need sustained request rates — if your traffic drops for more than 5 minutes, you’ll pay a cache write cost on the next request before reads resume.

The Bottom Line: Who Should Prioritize This

If you’re a solo founder or small team running an agent with a fixed system prompt and 1,000+ daily requests: enable caching immediately, move all dynamic content to the end of your prompt, and instrument the cache hit rate. This is the highest-ROI infrastructure change you can make in an afternoon.

If you’re running batch processing pipelines — document classification, extraction, summarization — restructure your prompts so the task instruction is a stable cacheable block and the document is the only variable. The savings scale linearly with volume.

If you’re at enterprise scale with complex multi-turn agents: invest in designing a proper cache boundary strategy. Decide which parts of your context are stable (system prompt, static knowledge, frozen history) vs. dynamic (retrieved chunks, current turn), and place cache_control markers accordingly. The implementation cost is days, not weeks, and the LLM prompt caching cost savings at scale justify it thoroughly.

If your traffic is low and sporadic (under 500 requests/day with irregular timing): don’t bother. The TTL decay will mostly neutralize your savings, and the engineering overhead of maintaining strict prompt prefix stability isn’t worth it at that volume. Focus on model selection and prompt compression instead — switching from Sonnet to Haiku for appropriate tasks will save you more.

Put this into practice

Try the Prompt Engineer agent — ready to use, no setup required.

Browse Agents →

Editorial note: API pricing, model capabilities, and tool features change frequently — always verify current details on the vendor’s website before building in production. Code examples are tested at time of writing; pin your dependency versions to avoid breaking changes. Some links in this article may be affiliate links — we may earn a commission if you sign up, at no extra cost to you.

Share.
Leave A Reply