Batch Processing with Claude API: Handle 10,000+ Documents Efficiently and Affordably

If you’re running 10,000 document classifications, summarizations, or extractions through Claude’s standard API, you’re leaving roughly 50% of your budget on the table. Claude batch API processing exists specifically for this use case — high-volume, latency-tolerant workloads where you need throughput over speed. Anthropic processes these requests asynchronously and passes the compute savings directly to you: half the price of synchronous calls, same model quality.

This article covers the Batch API architecture end-to-end — how requests are structured, how results come back, how to handle errors at scale, and what the real cost math looks like across realistic workloads. Everything here is based on actual implementation experience, not the docs.

What the Batch API Actually Does (and Doesn’t Do)

The Batch API is not a queue wrapper around the standard Messages API. It’s a separate endpoint that accepts a JSONL file of up to 10,000 requests, processes them within a 24-hour window, and returns a results file. You submit once, poll or wait for a webhook, then download and parse the output.

What you give up: real-time responses. If your workflow needs a reply within seconds — a customer is waiting, a webhook is expecting a response — the Batch API is the wrong tool. Use it for anything where you can afford to wait hours: nightly enrichment pipelines, document preprocessing, classification at scale, bulk content generation.

What you gain: 50% cost reduction on both input and output tokens, no rate-limit pressure during the run (Anthropic handles the scheduling), and a cleaner architecture for async workloads. At current Claude Haiku 3.5 pricing (~$0.00025/1K input tokens synchronous), batch brings that to roughly $0.000125/1K. On a 10,000-document run averaging 500 input tokens each, that’s the difference between ~$1.25 and ~$0.63 just for input. The savings compound fast on output-heavy tasks.

Supported Models

As of mid-2025, the Batch API supports Claude 3.5 Haiku, Claude 3.5 Sonnet, and Claude 3 Opus. For most bulk document tasks, Haiku is the practical default — it’s fast enough, accurate enough for classification/extraction, and the cheapest. Save Sonnet for tasks requiring nuanced reasoning or complex structured output. Opus batch processing is available but the cost-per-token math rarely justifies it unless the task demands it.

Structuring Your Batch Requests

Each item in a batch is a JSON object with a custom_id (your reference key), method (always “POST” for messages), and url (“/v1/messages”), plus a body containing the standard Messages API payload.

import anthropic
import json

client = anthropic.Anthropic()

def build_batch_requests(documents: list[dict]) -> list[anthropic.types.batch.RequestParam]:
    """
    Build batch request objects from a list of documents.
    Each document has 'id' and 'content' keys.
    """
    requests = []
    for doc in documents:
        requests.append({
            "custom_id": f"doc_{doc['id']}",  # must be unique within batch
            "params": {
                "model": "claude-haiku-4-5",
                "max_tokens": 256,
                "messages": [
                    {
                        "role": "user",
                        "content": f"""Classify the following document into exactly one category: 
Invoice, Contract, Support Ticket, or Other.
Respond with JSON only: {{"category": "...", "confidence": 0.0-1.0}}

Document:
{doc['content'][:2000]}"""  # truncate to control token cost
                    }
                ]
            }
        })
    return requests

# Load your documents (from DB, S3, whatever)
documents = [{"id": i, "content": f"Document content {i}"} for i in range(1000)]

# Submit the batch
batch = client.beta.messages.batches.create(requests=build_batch_requests(documents))
print(f"Batch ID: {batch.id}")
print(f"Status: {batch.processing_status}")

A few things worth noting here. The custom_id field is your mapping key between input and output — make it meaningful. I use doc_{primary_key} so I can directly join results back to my database without a lookup table. Also: truncate your input content. If you’re classifying documents, you usually don’t need the full text — the first 2000 characters handles 90% of classification cases and cuts your token bill significantly.

Polling and Retrieval: Don’t Spin a Loop

The Batch API is async, so you need a polling or notification strategy. Anthropic provides status via the batch object’s processing_status field, which transitions through in_progress → ended.

import time

def wait_for_batch(client: anthropic.Anthropic, batch_id: str, poll_interval: int = 60) -> dict:
    """
    Poll until batch completes. Returns the final batch object.
    poll_interval in seconds — 60s is reasonable, don't go lower.
    """
    while True:
        batch = client.beta.messages.batches.retrieve(batch_id)
        
        if batch.processing_status == "ended":
            print(f"Batch complete. "
                  f"Succeeded: {batch.request_counts.succeeded}, "
                  f"Errored: {batch.request_counts.errored}, "
                  f"Expired: {batch.request_counts.expired}")
            return batch
        
        print(f"Still processing... ({batch.request_counts.processing} remaining)")
        time.sleep(poll_interval)

def parse_batch_results(client: anthropic.Anthropic, batch_id: str) -> dict:
    """
    Stream results and build a custom_id → parsed_output mapping.
    """
    results = {}
    
    for result in client.beta.messages.batches.results(batch_id):
        if result.result.type == "succeeded":
            # Extract text from the response
            content = result.result.message.content[0].text
            try:
                parsed = json.loads(content)
                results[result.custom_id] = {"status": "ok", "data": parsed}
            except json.JSONDecodeError:
                # Model didn't return valid JSON — handle gracefully
                results[result.custom_id] = {"status": "parse_error", "raw": content}
        
        elif result.result.type == "errored":
            results[result.custom_id] = {
                "status": "error",
                "error": result.result.error.type
            }
    
    return results

For production, don’t run this polling loop in-process. Schedule it as a cron job or use a lightweight queue. Scheduling batch jobs with cron on Linux covers the infrastructure side of this cleanly if you want a repeatable pattern.

Error Recovery at Scale

At 10,000 requests, a 1% error rate means 100 failed items. You need a recovery strategy built in from the start — not bolted on after.

The Batch API returns three result types per item: succeeded, errored, and expired. Errored items typically hit rate limits or had malformed requests. Expired items didn’t get processed within 24 hours (rare but possible if Anthropic is under load). Your retry logic needs to handle both cases differently.

def extract_failures(results: dict) -> tuple[list, list]:
    """
    Separate hard failures (retry-able) from soft failures (needs inspection).
    Returns (retryable_ids, inspect_ids)
    """
    retryable = []
    inspect = []
    
    for custom_id, result in results.items():
        if result["status"] == "error":
            error_type = result.get("error", "")
            if error_type in ("overloaded_error", "api_error"):
                # These are transient — safe to retry
                retryable.append(custom_id)
            else:
                # invalid_request_error etc. — inspect before retrying
                inspect.append(custom_id)
        elif result["status"] == "parse_error":
            # Model returned non-JSON despite instructions
            inspect.append(custom_id)
    
    return retryable, inspect

def retry_failed_documents(client, original_documents: list, failed_ids: list):
    """
    Build a new batch with only the failed documents.
    Extract original doc ID from custom_id format "doc_{id}"
    """
    failed_doc_ids = {cid.replace("doc_", "") for cid in failed_ids}
    retry_docs = [d for d in original_documents if str(d['id']) in failed_doc_ids]
    
    if not retry_docs:
        return None
    
    print(f"Retrying {len(retry_docs)} documents...")
    return client.beta.messages.batches.create(
        requests=build_batch_requests(retry_docs)
    )

For parse errors specifically — where the model returns text instead of JSON — I’d recommend tightening your prompt with explicit JSON instructions and adding a response prefix. This ties into broader structured output patterns for Claude that are worth reading if you’re doing any extraction work at scale.

Real Cost Math: When Batch API Pays for Itself

Let’s run actual numbers on a document classification pipeline — something like categorizing 50,000 support tickets per day.

Model: Claude Haiku 3.5
Average input tokens per request: 400 (system prompt + truncated ticket)
Average output tokens per request: 50 (JSON with category + confidence)
Volume: 50,000 requests/day

Synchronous: (50,000 × 400 × $0.00025/1K) + (50,000 × 50 × $0.00125/1K) = $5.00 + $3.13 = $8.13/day

Batch API: same math at 50% discount = $4.07/day

That’s $1,480/year saved on a single pipeline. For larger models or output-heavy tasks, the gap widens. If you’re running invoice extraction with Sonnet at higher token counts, you’re looking at 10x larger absolute savings. Worth pairing this with our guide on managing LLM API costs at scale if you’re running multiple pipelines.

Combining Batch API with Prompt Caching

The Batch API supports prompt caching — and stacking both discounts is where the real savings live. If you have a long system prompt (say, a 2,000-token classification rubric), cache it once and you’re paying ~10% of input token cost on that portion for subsequent requests. Combined with the 50% batch discount, your effective cost on the cached prefix drops to around 5% of standard pricing. This deep-dive on LLM prompt caching strategies walks through exactly how to set cache breakpoints in your prompts.

Production Architecture: Putting It Together

Here’s the pattern I’d use for a production pipeline processing 10,000+ documents per run:

Ingestion layer: Documents land in a queue (SQS, Pub/Sub, Redis) as they arrive
Batch builder: A cron job fires every hour, pulls up to 10,000 pending items, builds the batch request array, and submits to the Batch API
Status tracker: Batch ID stored in your DB alongside a submitted_at timestamp and status field
Completion checker: Another cron job polls every 15 minutes for submitted batches older than 30 minutes
Result writer: On completion, streams results, writes to DB, marks documents as processed, queues any failures for retry

This keeps your main application completely decoupled from the Claude API. No synchronous API calls in the critical path, no rate limit headaches, predictable costs. If you’re deploying this on a serverless platform, check the breakdown of Vercel vs Replicate vs Beam for Claude agent deployments — the batch pattern changes which platform makes sense.

What Breaks in Production

A few things the documentation doesn’t warn you about:

The 24-hour expiry is real. If Anthropic is under heavy load, batches can approach this limit. Don’t assume same-day completion for time-sensitive pipelines.
Results streaming is not resumable. If your result-parsing process crashes mid-stream, you need to re-stream from the start. Build idempotent result writers that skip already-processed items.
No partial results during processing. You can’t peek at completed items while the batch is still running — it’s all or nothing.
custom_id must be unique per batch, but not globally. If you reuse IDs across batches, make sure your result-joining logic includes the batch ID as a composite key.

When to Use This vs. Standard API

Use the Batch API when: latency doesn’t matter, volume is 100+ requests, you’re doing classification/extraction/summarization, or you’re processing historical data.

Stick with the standard API when: a user is waiting for a response, you’re building interactive agents, you need streaming output, or your workflow requires one request to inform the next (chained reasoning).

For teams running large-scale document workflows — think invoice and receipt processing or similar extraction pipelines — Claude batch API processing should be your default architecture, not an optimization you add later. The cost savings are structural, not marginal, and the async architecture is actually cleaner for these workloads anyway.

Solo founders processing a few hundred documents: the complexity isn’t worth it until you’re over ~500 requests/day. Below that, the standard API with sensible rate limiting is simpler and fast enough. Once you cross that threshold, the batch architecture pays off both in cost and in operational reliability — you’re no longer at the mercy of synchronous timeout handling and retry logic in hot paths.

Frequently Asked Questions

How long does the Claude Batch API take to process requests?

Anthropic guarantees completion within 24 hours, but in practice most batches complete in 1–4 hours depending on size and current load. Plan your pipelines to handle up to 24 hours. Don’t use the Batch API if you need results within minutes.

Can I cancel a batch after submitting it?

Yes — you can cancel an in-progress batch via the API using client.beta.messages.batches.cancel(batch_id). Results for already-processed requests within that batch are still retrievable. You won’t be charged for requests that hadn’t started processing when you cancelled.

What’s the maximum number of requests per batch?

The current limit is 10,000 requests per batch, with a maximum total of 32MB for the request file. If you have more than 10,000 documents, split them into multiple batches and submit concurrently — there’s no limit on how many batches you can have running simultaneously.

Does the Batch API support tool use and function calling?

Yes, the Batch API supports the same features as the standard Messages API including tool use, system prompts, and multi-turn messages. However, since the API is async, you can’t handle tool calls mid-batch — tool use only makes sense for single-turn patterns where the model’s tool output is the final deliverable.

Can I use prompt caching with the Batch API?

Yes, and it’s one of the highest-leverage optimizations available. Add "cache_control": {"type": "ephemeral"} to your system prompt or any large repeated context block. The cache persists across requests within and across batches for roughly 5 minutes of inactivity. Combined with the 50% batch discount, cached prefixes can cost as little as 5% of standard pricing.

Put this into practice

Try the Api Security Audit agent — ready to use, no setup required.

Browse Agents →

Editorial note: API pricing, model capabilities, and tool features change frequently — always verify current details on the vendor’s website before building in production. Code examples are tested at time of writing; pin your dependency versions to avoid breaking changes. Some links in this article may be affiliate links — we may earn a commission if you sign up, at no extra cost to you.

Batch Processing with Claude API: Handle 10,000+ Documents Efficiently and Affordably

Claude MCP servers: complete setup guide for production tool integrations

Prompt token optimization: reducing LLM API costs without sacrificing quality

Building Claude agents with persistent memory: architecture for multi-session state management

Stacking multiple Claude models in a single workflow: when to use Haiku vs Sonnet vs Opus

Building Claude agents with Starlette 1.0: modern Python web framework integration

Holotron-12B for computer use agents: building high-throughput vision-based automation

Batch Processing with Claude API: Handle 10,000+ Documents Efficiently and Affordably

What the Batch API Actually Does (and Doesn’t Do)

Supported Models

Structuring Your Batch Requests

Polling and Retrieval: Don’t Spin a Loop

Error Recovery at Scale

Real Cost Math: When Batch API Pays for Itself

Combining Batch API with Prompt Caching

Production Architecture: Putting It Together

What Breaks in Production

When to Use This vs. Standard API

Frequently Asked Questions

How long does the Claude Batch API take to process requests?

Can I cancel a batch after submitting it?

What’s the maximum number of requests per batch?

Does the Batch API support tool use and function calling?

Can I use prompt caching with the Batch API?

Put this into practice

Related Claude Code Agents

Related Posts

Claude MCP servers: complete setup guide for production tool integrations

Prompt token optimization: reducing LLM API costs without sacrificing quality

Building Claude agents with persistent memory: architecture for multi-session state management

Stacking multiple Claude models in a single workflow: when to use Haiku vs Sonnet vs Opus

Building Claude agents with Starlette 1.0: modern Python web framework integration

Holotron-12B for computer use agents: building high-throughput vision-based automation