Monday, April 6

Most teams burning through Claude API budget are making the same three mistakes: running every task through Sonnet when Haiku would do the job, sending the same massive system prompt fresh on every request, and processing documents one at a time when they could batch them overnight. If you want to reduce Claude API costs by 40–60% without touching your output quality, those are exactly the three levers to pull — and this article gives you the exact mechanics and code to do it.

This isn’t a surface-level “use a cheaper model” post. Prompt caching has a non-obvious TTL behavior that will catch you out in production. Batch processing has a 24-hour turnaround you need to architect around. And model selection is only smart if you’ve actually benchmarked accuracy on your specific tasks, not just assumed Haiku can’t keep up. Let’s get into it.

Understanding Where Claude API Costs Actually Come From

Before optimising anything, you need to know what you’re paying for. Claude pricing as of mid-2025 breaks down like this:

  • Claude Haiku 3.5: ~$0.80 / million input tokens, ~$4.00 / million output tokens
  • Claude Sonnet 3.7: ~$3.00 / million input tokens, ~$15.00 / million output tokens
  • Claude Opus 4: ~$15.00 / million input tokens, ~$75.00 / million output tokens

The output-to-input price ratio is the thing most people miss. Output tokens cost 3–5x more than input tokens across all tiers. That means a pipeline generating verbose JSON responses is going to hurt significantly more than one with tight, structured outputs. Constraining output format — not just choosing a cheaper model — is often the fastest cost reduction you can make.

Typical production workloads are heavily input-heavy: a 2,000-token system prompt, 500 tokens of user context, 200 tokens of response. At those ratios, input token optimisation matters as much as model tier.

Prompt Caching: The 90% Cost Reduction Nobody Talks About Correctly

Anthropic’s prompt caching lets you cache a prefix of your input and pay only 10% of the normal input token cost on cache hits. That’s not a typo — cached tokens cost $0.08 / million on Haiku and $0.30 / million on Sonnet instead of the full rate. The write cost (first time you populate the cache) is 25% higher than normal, but it amortises fast.

How Cache Hits Actually Work

The cache is keyed on the exact token sequence up to your cache breakpoint. If you change even one token before the breakpoint, it’s a full cache miss. This catches teams out when they’re doing any of the following:

  • Injecting timestamps or request IDs into the system prompt
  • Personalising the system prompt per-user before a static block
  • Appending dynamic context before a large static document

The fix: put everything static first, everything dynamic last. Your 3,000-token system prompt + reference document goes at the top. The user’s specific question goes at the bottom. Cache breakpoints can be set at multiple positions, but the most common pattern is a single breakpoint after your static context block.

The TTL is 5 minutes of inactivity. After 5 minutes without a hit, the cache is evicted. For async workflows, this matters — if you’re processing a batch every 10 minutes, you’re paying full write cost every batch. The workaround is either to keep a warm-up call running, or to accept the write overhead and focus caching only on truly high-traffic routes.

import anthropic

client = anthropic.Anthropic()

# Large static system prompt + reference document
STATIC_SYSTEM_CONTENT = """You are a contract analyst specialising in SaaS agreements.
[... 2,500 tokens of instructions and reference clauses ...]"""

def analyse_contract_clause(user_clause: str) -> str:
    response = client.messages.create(
        model="claude-haiku-3-5",
        max_tokens=512,
        system=[
            {
                "type": "text",
                "text": STATIC_SYSTEM_CONTENT,
                "cache_control": {"type": "ephemeral"}  # marks cache breakpoint
            }
        ],
        messages=[
            {
                "role": "user",
                "content": user_clause  # dynamic content goes here — after the cached prefix
            }
        ]
    )
    
    # Check cache stats in response metadata
    usage = response.usage
    print(f"Cache read tokens: {usage.cache_read_input_tokens}")
    print(f"Cache write tokens: {usage.cache_creation_input_tokens}")
    
    return response.content[0].text

On a real contract analysis pipeline processing 500 clauses/day with a 2,500-token system prompt: without caching, that’s 1.25M input tokens/day at Haiku rates (~$1.00/day). With caching, after the first write, you’re paying $0.08/M for 1.24M cached tokens (~$0.10/day). That’s a 90% reduction on the input side — and input tokens dominate this workload.

Batch Processing: Trading Latency for 50% Off

The Claude Batch API gives you a flat 50% discount on both input and output tokens. The trade-off: results are returned asynchronously, up to 24 hours later (usually much faster — often 1–4 hours in practice, but don’t count on it for SLAs).

This makes batch processing ideal for:

  • Document processing pipelines (invoices, contracts, reports)
  • Nightly content enrichment or categorisation
  • Bulk evaluation runs
  • Any workflow where the user doesn’t need a real-time response

We’ve written a more detailed breakdown of architecting high-volume pipelines in our guide to batch processing workflows with Claude API, but here’s the core implementation pattern:

import anthropic
import json
import time

client = anthropic.Anthropic()

def submit_batch(documents: list[dict]) -> str:
    """Submit a batch of documents for async processing."""
    requests = [
        {
            "custom_id": f"doc_{i}",  # used to match results back to inputs
            "params": {
                "model": "claude-haiku-3-5",
                "max_tokens": 256,
                "messages": [
                    {
                        "role": "user",
                        "content": f"Classify this support ticket into one of [billing, technical, account, other]. Return JSON only.\n\n{doc['text']}"
                    }
                ]
            }
        }
        for i, doc in enumerate(documents)
    ]
    
    batch = client.beta.messages.batches.create(requests=requests)
    return batch.id

def poll_batch(batch_id: str, max_wait_seconds: int = 3600) -> list[dict]:
    """Poll until batch completes, return results."""
    start = time.time()
    while time.time() - start < max_wait_seconds:
        batch = client.beta.messages.batches.retrieve(batch_id)
        
        if batch.processing_status == "ended":
            results = []
            for result in client.beta.messages.batches.results(batch_id):
                if result.result.type == "succeeded":
                    results.append({
                        "id": result.custom_id,
                        "output": result.result.message.content[0].text
                    })
            return results
        
        time.sleep(60)  # check every minute — don't hammer the API
    
    raise TimeoutError(f"Batch {batch_id} did not complete in time")

For a pipeline processing 10,000 support tickets/day at ~300 tokens input, ~50 tokens output: at standard Haiku pricing that’s about $2.60/day. With batch pricing, $1.30/day. Over a year, that’s ~$475 saved on a single pipeline. Stack that across multiple workloads and it compounds quickly.

One production gotcha: the 24-hour window means your error handling needs to account for expired batches. Build a reconciliation step that identifies unprocessed custom_ids and resubmits them. See our patterns for LLM fallback and retry logic for how to structure this robustly.

Model Selection Strategy: Stop Using Sonnet for Everything

The default mistake is treating Claude Sonnet as your go-to for every task because it’s “safe.” Sonnet costs 3.75x more per input token than Haiku. If Haiku achieves acceptable accuracy on your task, you’re paying a 275% premium for no benefit.

What Haiku Actually Handles Well

Haiku 3.5 is genuinely capable for:

  • Classification and routing (sentiment, category, intent)
  • Structured data extraction from well-formatted documents
  • Simple summarisation (short docs, bullet points)
  • Template filling and format transformation
  • First-pass filtering before escalation to a more capable model

Where Haiku struggles: nuanced reasoning over ambiguous inputs, multi-step logical chains, complex code generation, and tasks requiring deep context synthesis across a long document. For those, Sonnet earns its price. If you’re building code generation workflows, the Claude vs GPT-4 code generation benchmarks are worth reviewing before committing to a model tier.

The Two-Stage Routing Pattern

The pattern I’d actually deploy in production: use Haiku for a confidence-scored first pass, escalate to Sonnet only when the Haiku response signals uncertainty.

import anthropic
import json

client = anthropic.Anthropic()

CLASSIFICATION_PROMPT = """Classify this customer message. Return JSON with:
- category: one of [billing, technical, account, general]
- confidence: 0.0-1.0
- escalate: true if complex or high-stakes

Message: {message}"""

def classify_with_routing(message: str) -> dict:
    # First pass: Haiku (~$0.0004 per call at typical lengths)
    haiku_response = client.messages.create(
        model="claude-haiku-3-5",
        max_tokens=128,
        messages=[{
            "role": "user", 
            "content": CLASSIFICATION_PROMPT.format(message=message)
        }]
    )
    
    result = json.loads(haiku_response.content[0].text)
    
    # Only escalate if Haiku flags low confidence or explicit escalation
    if result["confidence"] < 0.75 or result["escalate"]:
        sonnet_response = client.messages.create(
            model="claude-sonnet-3-7",
            max_tokens=256,
            messages=[{
                "role": "user",
                "content": CLASSIFICATION_PROMPT.format(message=message)
            }]
        )
        result = json.loads(sonnet_response.content[0].text)
        result["model_used"] = "sonnet"
    else:
        result["model_used"] = "haiku"
    
    return result

In a real customer support triage workflow, roughly 70–80% of tickets are straightforward enough for Haiku to handle with high confidence. That means you’re paying Sonnet rates only 20–30% of the time, reducing your effective cost per request by 50–60% compared to Sonnet-for-everything.

Combining All Three: A Real Cost Reduction Stack

Here’s what a fully optimised pipeline looks like with all three techniques applied simultaneously, using a document processing use case as an example:

  • Model tier: Haiku for classification and extraction, Sonnet only for complex reasoning or escalations
  • Prompt caching: System prompt + schema definition cached (saving 90% on those tokens for high-frequency routes)
  • Batch processing: Non-urgent document queue processed overnight (50% discount across the board)

For a pipeline processing 5,000 invoices/day with a 1,500-token system prompt and ~400 token responses:

  • Baseline (Sonnet, no cache, no batch): ~$47/day
  • Haiku only: ~$10/day (78% reduction)
  • Haiku + caching: ~$4/day (57% additional reduction)
  • Haiku + caching + batch: ~$2/day (50% additional reduction)

That’s roughly a 95% total cost reduction versus the naive implementation. In practice, you’ll have some escalations to Sonnet and not every request will be batchable, so real-world savings tend to land at 60–80%. Still significant.

For teams building more complex agent architectures where cost compounds across multiple calls per user session, it’s also worth auditing your output verbosity. Constraining JSON schema, removing unnecessary fields, and adding max_tokens limits appropriate to each task is free money. If you’re seeing hallucinated or overly verbose outputs, that’s often a sign your prompts need tightening — the structured output and verification patterns guide covers this territory well.

What the Documentation Gets Wrong (Misconceptions Worth Addressing)

Misconception 1: “Caching works automatically.” It doesn’t. You have to explicitly mark cache breakpoints using cache_control. Without it, every request is billed at full rate regardless of repeated content.

Misconception 2: “Haiku is for simple tasks only.” With a well-structured system prompt and clear output schema, Haiku handles a surprising breadth of extraction and classification tasks accurately. The failure mode is usually prompt quality, not model capability. Good system prompt design matters enormously — see our system prompts best practices guide for what actually moves the needle.

Misconception 3: “Batch API is only for huge workloads.” Even 100 documents/day benefits from batch pricing if latency isn’t a constraint. The setup overhead is minimal once you have the polling logic in place, and 50% off is 50% off regardless of scale.

Frequently Asked Questions

How long does Claude prompt cache actually last?

The cache TTL is 5 minutes of inactivity. If no requests hit the cached prefix within 5 minutes, the cache is evicted and the next request pays the full write cost again (which is 1.25x the normal input rate). For high-traffic routes this is fine, but for sporadic batch workloads you need to factor in frequent re-writes.

Can I use prompt caching and batch processing together?

Yes, and this is actually the optimal combination for high-volume, latency-tolerant workloads. The 50% batch discount applies to all token types including cached token writes. You still need to structure your prompts correctly for caching (static content first, cache_control breakpoints set), but both discounts stack.

How do I know if Claude Haiku is accurate enough for my use case?

Run an evaluation on 200–500 representative samples from your actual data, comparing Haiku output against Sonnet output (use Sonnet as the reference baseline). Measure accuracy on your specific success criteria, not generic benchmarks. If Haiku accuracy is within 5% of Sonnet for your task, use Haiku. If the gap is larger, check whether better prompting closes it before paying for the upgrade.

What’s the minimum token count where prompt caching makes sense?

Anthropic requires a minimum of 1,024 tokens for a cache breakpoint to be set. Below that, caching isn’t available. Above 1,024 tokens, it starts paying off after roughly 2 cache hits per write (given the 1.25x write surcharge). For system prompts of 2,000+ tokens with any reasonable request volume, caching almost always pays.

How long does Claude Batch API processing actually take?

Anthropic’s SLA is up to 24 hours, but in practice most batches complete in 1–6 hours depending on size and platform load. Don’t build anything where you need a guaranteed turnaround faster than 24 hours using the Batch API — treat it as eventually-consistent infrastructure, not a real-time API with a lag.

Bottom Line: Match Strategy to Your Situation

If you’re a solo founder or small team on a tight budget: start with model routing. Get Haiku working for 70%+ of your traffic first — that alone typically cuts your bill in half. Add caching to your system prompts once you’ve stabilised prompt content. Batch comes last because it requires workflow changes.

If you’re a growth-stage team with predictable document or data processing workloads: batch processing and caching should both be in production. The engineering cost is one-time; the savings compound daily. A two-stage model routing layer on top completes the picture.

If you’re enterprise-scale: all three techniques apply, but spend time on measurement first. Build token usage logging by model tier and task type before optimising — you need to know where spend is concentrated before you know where to focus. Observability tooling like Helicone or LangSmith will pay for itself in the first week.

The practical ceiling for reducing Claude API costs through these techniques is around 70–85% versus a naive baseline, depending on your workload characteristics. That’s not theoretical — teams running high-volume extraction and classification pipelines are achieving this today. The strategies above are production-tested and the code runs.

Put this into practice

Try the Api Security Audit agent — ready to use, no setup required.

Browse Agents →

Editorial note: API pricing, model capabilities, and tool features change frequently — always verify current details on the vendor’s website before building in production. Code examples are tested at time of writing; pin your dependency versions to avoid breaking changes. Some links in this article may be affiliate links — we may earn a commission if you sign up, at no extra cost to you.

Share.
Leave A Reply