If you’re running LLM workloads at any meaningful scale, prompt caching API costs are probably the fastest lever you haven’t pulled yet. Most teams I talk to are still sending the same 2,000-token system prompt on every single API call — and that adds up brutally fast. At Anthropic’s Claude Sonnet pricing, a 2,000-token system prompt costs roughly $0.006 per request in input tokens alone. Run 10,000 requests a day and you’re burning $60/day, $1,800/month, just on tokens you’re sending identically every time.
The good news: both Anthropic and OpenAI now offer native prompt caching that can cut that spend by 50-90% on cached content. This article covers exactly how to implement it, where it breaks, and how to layer in response caching on top for maximum savings.
What Prompt Caching Actually Does (and Doesn’t Do)
Prompt caching works by storing a KV (key-value) cache of the computed attention states for a prefix of your prompt. When you send the same prefix again, the model skips recomputing those tokens and loads the cached state instead. You pay a reduced rate for cache hits and a slightly elevated rate to write to the cache on first use.
This is different from response caching (storing the full output and returning it verbatim). Prompt caching still runs inference — you still get a fresh, variable response. You’re just not re-processing the static parts of your context on every call.
Where it saves you money
- Large system prompts — personas, instructions, brand guidelines, output format specs
- Reference documents — contracts, codebases, knowledge base articles you stuff into context
- Few-shot examples — long example sets that don’t change between requests
- Conversation history — in multi-turn agents where early turns are stable
Where it won’t help
- Short prompts under ~1,000 tokens — caching overhead isn’t worth it
- Highly dynamic prompts where the prefix changes every request
- Output tokens — only input/context tokens are cached
Anthropic Prompt Caching: Implementation and Pricing
Claude’s prompt caching is available on Claude 3.5 Sonnet, Claude 3 Opus, Claude 3 Haiku, and Claude 3.5 Haiku. The pricing structure as of this writing:
- Cache write: 1.25x the base input token price
- Cache read (hit): 0.1x the base input token price — that’s a 90% discount
- Cache TTL: 5 minutes (resets on each access)
So on Claude 3.5 Sonnet ($3/MTok input), you pay $3.75/MTok to write the cache and $0.30/MTok on hits. Your break-even is roughly 2 requests within 5 minutes — after that, every cache hit is pure savings.
To enable it, you use the cache_control parameter on specific content blocks:
import anthropic
client = anthropic.Anthropic()
# Your expensive, static system prompt
SYSTEM_PROMPT = """You are a senior contract analyst specializing in SaaS agreements...
[2000+ tokens of detailed instructions, examples, and context]
"""
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=1024,
system=[
{
"type": "text",
"text": SYSTEM_PROMPT,
# This tells Claude to cache everything up to this point
"cache_control": {"type": "ephemeral"}
}
],
messages=[
{
"role": "user",
"content": "Review this contract clause: [dynamic user input here]"
}
]
)
# Check if the cache was hit
usage = response.usage
print(f"Input tokens: {usage.input_tokens}")
print(f"Cache creation tokens: {usage.cache_creation_input_tokens}")
print(f"Cache read tokens: {usage.cache_read_input_tokens}")
The cache_creation_input_tokens and cache_read_input_tokens fields in the response tell you exactly what happened. On the first call you’ll see cache creation tokens; on subsequent calls within 5 minutes, you’ll see cache read tokens and near-zero cache creation tokens.
Caching large documents with Claude
You can cache multiple blocks, which is useful when you’re injecting a reference document alongside your system prompt:
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=1024,
system=[
{
"type": "text",
"text": "You are a contract analyst. Use the reference document below.",
"cache_control": {"type": "ephemeral"} # Cache the system instructions
}
],
messages=[
{
"role": "user",
"content": [
{
"type": "text",
# Cache the big reference doc that doesn't change
"text": LARGE_REFERENCE_DOCUMENT,
"cache_control": {"type": "ephemeral"}
},
{
"type": "text",
# This part changes per request — not cached
"text": f"Now analyze this specific clause: {user_clause}"
}
]
}
]
)
Important gotcha: The minimum cacheable block is 1,024 tokens for Sonnet/Opus and 2,048 tokens for Haiku. Anything smaller won’t be cached regardless of the flag. This trips people up when they first implement it and see zero cache reads.
OpenAI Automatic Prompt Caching
OpenAI’s approach is different: caching is automatic on GPT-4o and GPT-4o-mini — you don’t opt in. Any prompt prefix of 1,024+ tokens that you’ve sent recently gets cached automatically, with a 50% discount on cached input tokens (vs Anthropic’s 90%).
The tradeoff: less control, less savings. You can’t explicitly mark what to cache, so if your prompt structure changes even slightly, you’ll miss the cache. OpenAI’s TTL is also undisclosed — empirically it seems to be around 5-10 minutes under normal load.
from openai import OpenAI
client = OpenAI()
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "system",
# Keep this IDENTICAL across requests to hit the cache
"content": STATIC_SYSTEM_PROMPT
},
{
"role": "user",
"content": dynamic_user_message
}
]
)
# Check cache usage in response
usage = response.usage
if hasattr(usage, 'prompt_tokens_details'):
cached = usage.prompt_tokens_details.cached_tokens
print(f"Cached tokens: {cached}")
print(f"Cache hit rate: {cached / usage.prompt_tokens / 100:.1f}%")
For OpenAI, the main thing to get right is prefix stability. Put all your static content — system prompt, examples, reference docs — before the dynamic content. Even a single token change in the prefix invalidates the cache for everything after it.
Response Caching on Top: The Extra Layer Worth Adding
Prompt caching doesn’t help when users ask identical questions. Response caching does. The idea: store the full LLM response keyed to the input, and return it directly without hitting the API at all.
This is high-impact for customer-facing products where many users ask the same things. A simple Redis-backed implementation:
import hashlib
import json
import redis
import anthropic
redis_client = redis.Redis(host='localhost', port=6379, db=0)
anthropic_client = anthropic.Anthropic()
def get_cache_key(system_prompt: str, user_message: str) -> str:
"""Create a deterministic cache key from the inputs."""
payload = json.dumps({
"system": system_prompt,
"user": user_message
}, sort_keys=True)
return f"llm:response:{hashlib.sha256(payload.encode()).hexdigest()}"
def cached_llm_call(
system_prompt: str,
user_message: str,
ttl_seconds: int = 3600, # 1 hour default
model: str = "claude-3-5-haiku-20241022"
) -> dict:
cache_key = get_cache_key(system_prompt, user_message)
# Try cache first
cached = redis_client.get(cache_key)
if cached:
result = json.loads(cached)
result["cache_hit"] = True
return result
# Cache miss — call the API
response = anthropic_client.messages.create(
model=model,
max_tokens=1024,
system=[{
"type": "text",
"text": system_prompt,
"cache_control": {"type": "ephemeral"} # Still use prompt caching
}],
messages=[{"role": "user", "content": user_message}]
)
result = {
"content": response.content[0].text,
"usage": {
"input_tokens": response.usage.input_tokens,
"output_tokens": response.usage.output_tokens,
"cache_creation_tokens": getattr(response.usage, 'cache_creation_input_tokens', 0),
"cache_read_tokens": getattr(response.usage, 'cache_read_input_tokens', 0),
},
"cache_hit": False
}
# Store in Redis with TTL
redis_client.setex(cache_key, ttl_seconds, json.dumps(result))
return result
You can extend this with semantic caching — using embeddings to find “close enough” previous responses instead of exact matches. Libraries like GPTCache do this out of the box, though I’ve found the exact-match approach is usually enough for structured use cases like classification, extraction, and FAQ answering.
Measuring Your Actual Savings
Don’t guess — track your cache hit rate in production. Here’s a minimal logging wrapper that gives you the data you need:
from dataclasses import dataclass, field
from collections import defaultdict
@dataclass
class CacheStats:
total_requests: int = 0
response_cache_hits: int = 0
prompt_cache_hits: int = 0
total_input_tokens: int = 0
cached_input_tokens: int = 0
estimated_savings_usd: float = 0.0
# Claude 3.5 Sonnet pricing per token
BASE_INPUT_PRICE = 3.0 / 1_000_000 # $3 per MTok
CACHE_READ_PRICE = 0.30 / 1_000_000 # $0.30 per MTok
def log_api_call(self, usage):
self.total_requests += 1
self.total_input_tokens += usage.input_tokens
cache_read = getattr(usage, 'cache_read_input_tokens', 0)
if cache_read:
self.prompt_cache_hits += 1
self.cached_input_tokens += cache_read
# Savings = what we would have paid minus what we did pay
saved = cache_read * (self.BASE_INPUT_PRICE - self.CACHE_READ_PRICE)
self.estimated_savings_usd += saved
def report(self):
hit_rate = self.prompt_cache_hits / max(self.total_requests, 1) * 100
print(f"Prompt cache hit rate: {hit_rate:.1f}%")
print(f"Estimated savings: ${self.estimated_savings_usd:.4f}")
print(f"Cached token ratio: {self.cached_input_tokens}/{self.total_input_tokens}")
Common Mistakes That Destroy Your Cache Hit Rate
After running this in production across several projects, here are the failure modes I see most often:
- Timestamps or request IDs in the system prompt. If you inject
Current date: {datetime.now()}into your static prefix, you’ll get zero cache hits. Move dynamic context to the user message. - Inconsistent whitespace or formatting. Caching is byte-exact on the prefix. A trailing space or different newline character breaks it. Lock your templates down.
- Cache TTL mismatches with request patterns. If your requests are spaced more than 5 minutes apart, you’ll never hit the cache. Batch processing during off-hours? You might need response caching instead.
- Caching blocks under the minimum token threshold. Check your block sizes before assuming caching is active.
- Not pinning the cache_control to the right boundary. On Anthropic, the cache covers everything up to and including the marked block. Put the marker at the end of your static content, not the beginning.
Which Strategy to Use and When
Here’s my honest recommendation based on use case:
- Solo founder with a Claude-based product: Start with Anthropic’s prompt caching. Implement it in an afternoon, get 50-90% savings on your system prompt tokens within days. Add response caching with Redis once you have real traffic data.
- Team using OpenAI: Automatic caching is already on — audit your prompt structure to make sure your prefix is stable. The 50% discount is meaningful at scale even without explicit control.
- High-volume automation (n8n/Make workflows): Response caching is your biggest win here. Many automation workflows repeat identical inputs. A simple Redis layer in front of your LLM node can eliminate 60%+ of API calls entirely.
- RAG pipelines with large context: Use prompt caching on your static instructions and few-shot examples. The retrieved chunks change per query, but your system config doesn’t — cache what you can.
Stacking all three strategies — prompt caching, response caching, and disciplined prompt structure — is how you realistically hit 40-50% total cost reduction. Each layer is independent; implement them in order of implementation effort, starting with Anthropic prompt caching since it’s three lines of code for potentially massive savings.
The math on prompt caching API costs is too good to ignore. If you’re sending a 3,000-token system prompt a thousand times a day on Claude Sonnet, that’s $9/day in system prompt tokens alone. With caching active and a reasonable hit rate, you’re paying under $1/day for the same workload. That’s before you’ve touched your architecture or optimized a single thing.
Editorial note: API pricing, model capabilities, and tool features change frequently — always verify current details on the vendor’s website before building in production. Code examples are tested at time of writing; pin your dependency versions to avoid breaking changes. Some links in this article may be affiliate links — we may earn a commission if you sign up, at no extra cost to you.

