If you’re running extraction pipelines, content classification, or document analysis at scale, you’ve probably already felt the pain: standard API calls get expensive fast, rate limits cause headaches, and managing thousands of concurrent requests turns into its own engineering problem. Claude batch API processing sidesteps most of this by letting you submit large jobs asynchronously and get results back within 24 hours — at exactly 50% of standard API pricing. For workloads that don’t need real-time responses, this is one of the most practical cost optimizations available right now.
This article walks through a complete implementation: structuring your batch jobs, submitting them correctly, polling for results, handling failures, and calculating real costs. The examples use Anthropic’s Message Batches API with claude-haiku-3-5 and claude-sonnet-4-5, but the architecture applies across model tiers.
What the Batch API Actually Does (and Doesn’t Do)
The Anthropic Message Batches API accepts up to 10,000 requests in a single batch, processes them asynchronously, and returns results within 24 hours. In practice, most jobs complete in 1–4 hours depending on load. You get the same model quality as synchronous calls — this isn’t a degraded endpoint, just a different delivery mechanism.
What you give up: latency. If your use case requires responses in under a few seconds, the batch API is the wrong tool. But for overnight document processing, weekly classification runs, bulk content generation, or any pipeline where you can afford to wait, it’s a straightforward win.
Current pricing at time of writing
- claude-haiku-3-5: $0.0004 input / $0.002 output per 1K tokens (standard), batch cuts this to $0.0002 / $0.001
- claude-sonnet-4-5: $0.003 input / $0.015 output per 1K tokens (standard), batch cuts this to $0.0015 / $0.0075
- claude-opus-4: $0.015 input / $0.075 output (standard), batch at $0.0075 / $0.0375
Run 10,000 documents through Haiku for extraction — say 500 tokens in, 200 tokens out per doc — and you’re looking at roughly $6 standard vs $3 batch. Swap that for Sonnet and it’s $45 vs $22.50. The savings compound quickly on real workloads.
Structuring Your Batch Job: The Request Format
Each batch is a list of request objects. Every object needs a unique custom_id (your identifier for matching results later), plus a standard messages payload. Here’s the core structure:
import anthropic
import json
from pathlib import Path
client = anthropic.Anthropic(api_key="your-api-key")
def build_batch_requests(documents: list[dict]) -> list[dict]:
"""
documents: list of {"id": str, "text": str}
Returns batch-formatted request list
"""
requests = []
for doc in documents:
requests.append({
"custom_id": doc["id"], # must be unique within the batch
"params": {
"model": "claude-haiku-3-5",
"max_tokens": 300,
"messages": [
{
"role": "user",
"content": f"""Extract the following from this document and return as JSON:
- main_topic (string)
- sentiment (positive/negative/neutral)
- key_entities (list of up to 5 named entities)
Document:
{doc["text"]}
Return only valid JSON, no explanation."""
}
]
}
})
return requests
A few things worth noting: custom_id can be up to 64 characters and only alphanumeric plus hyphens and underscores. If you use database IDs, you might need to sanitize them. The params object accepts the same fields as a standard messages API call — system prompts, temperature, tool use — everything works.
Submitting and Monitoring Your Batch
Submission is a single API call. The tricky part is what comes after: you need to poll for completion, handle partial failures, and parse the results file correctly.
def submit_batch(requests: list[dict]) -> str:
"""Submit batch and return batch ID"""
batch = client.messages.batches.create(requests=requests)
print(f"Batch submitted: {batch.id}")
print(f"Status: {batch.processing_status}")
return batch.id
def poll_until_complete(batch_id: str, poll_interval: int = 60) -> dict:
"""
Poll batch status until complete.
poll_interval: seconds between checks (60s is reasonable; don't hammer the API)
"""
import time
while True:
batch = client.messages.batches.retrieve(batch_id)
status = batch.processing_status
counts = batch.request_counts
print(
f"Status: {status} | "
f"Processing: {counts.processing} | "
f"Succeeded: {counts.succeeded} | "
f"Errored: {counts.errored}"
)
if status == "ended":
return batch
time.sleep(poll_interval)
The processing_status field moves through in_progress → ended. There’s no “completed successfully” vs “completed with errors” distinction at the top level — you get ended either way, and the error detail lives in the individual results. This catches people out the first time.
Parsing results and handling failures
Results stream back as JSONL. Each line is one result object containing your custom_id and either a successful response or an error. Always process the stream — don’t try to load the entire results payload into memory for large batches.
def process_batch_results(batch_id: str) -> tuple[list, list]:
"""
Returns (successes, failures)
successes: list of {"id": str, "result": dict}
failures: list of {"id": str, "error": str}
"""
successes = []
failures = []
# Stream results — more memory-efficient than loading all at once
for result in client.messages.batches.results(batch_id):
custom_id = result.custom_id
if result.result.type == "succeeded":
# Extract text content from the response
content = result.result.message.content[0].text
try:
parsed = json.loads(content)
successes.append({"id": custom_id, "result": parsed})
except json.JSONDecodeError:
# Model returned something that isn't valid JSON
# Log it but don't crash the whole result set
failures.append({
"id": custom_id,
"error": f"JSON parse error: {content[:200]}"
})
elif result.result.type == "errored":
error = result.result.error
failures.append({
"id": custom_id,
"error": f"{error.type}: {error.message}"
})
return successes, failures
The most common error type you’ll see is overloaded_error — the model was busy at processing time. Anthropic’s docs say they retry these internally to some extent, but in practice you’ll still get a small failure rate on large batches. Build retry logic: collect the failed IDs, rebuild those as a new batch, resubmit. For a 10K document run I’ve seen failure rates of 0.1–0.5%, so plan for it but don’t over-engineer it.
Putting It All Together: End-to-End Pipeline
Here’s a complete wrapper that handles the full cycle for a document processing job:
def run_batch_pipeline(
documents: list[dict],
batch_size: int = 10_000,
output_path: str = "results.jsonl"
) -> dict:
"""
Full batch pipeline with automatic chunking for >10K documents.
documents: list of {"id": str, "text": str}
batch_size: max requests per batch (API limit is 10,000)
output_path: where to write results as JSONL
Returns summary dict with counts and cost estimate
"""
import time
all_successes = []
all_failures = []
batch_ids = []
# Chunk documents into batches if over the 10K limit
chunks = [documents[i:i+batch_size] for i in range(0, len(documents), batch_size)]
print(f"Processing {len(documents)} documents in {len(chunks)} batch(es)")
# Submit all batches first — don't wait for each one sequentially
for chunk in chunks:
requests = build_batch_requests(chunk)
batch_id = submit_batch(requests)
batch_ids.append(batch_id)
time.sleep(2) # small gap between submissions to avoid rate limit on batch creation
# Now poll all batches to completion
for batch_id in batch_ids:
print(f"\nWaiting on batch {batch_id}...")
poll_until_complete(batch_id)
successes, failures = process_batch_results(batch_id)
all_successes.extend(successes)
all_failures.extend(failures)
# Write results to JSONL
with open(output_path, "w") as f:
for item in all_successes:
f.write(json.dumps(item) + "\n")
# Rough cost estimate for Haiku (500 input tokens, 200 output per doc)
total_docs = len(all_successes)
est_cost = total_docs * ((500 * 0.0000002) + (200 * 0.000001))
summary = {
"total_submitted": len(documents),
"succeeded": len(all_successes),
"failed": len(all_failures),
"failure_rate": f"{len(all_failures)/len(documents)*100:.1f}%",
"estimated_cost_usd": round(est_cost, 4),
"failures": all_failures[:10] # first 10 for inspection
}
print(f"\nDone. {summary}")
return summary
Submit all your batches first, then poll them in parallel. Don’t submit-wait-submit-wait — that’s leaving throughput on the table. Multiple concurrent batches process simultaneously and you can have up to 100 unfinished batches at a time per API documentation.
What Breaks in Production (Honest Assessment)
JSON output reliability: Even with explicit instructions to return JSON, Haiku will occasionally add preamble like “Here is the extracted JSON:” before the actual object. Add a post-processing step that strips everything before the first { character. For critical pipelines, use claude-sonnet-4-5 instead — it’s significantly more reliable at structured output, and the batch discount still makes it affordable.
Token estimation: You can’t get a cost estimate before submitting. Build your own token counter using the tiktoken library or the Anthropic tokenizer endpoint before committing to a large run. Discovering your documents average 3,000 tokens instead of 500 changes your cost by 6x.
Batch expiry: Batches expire 29 days after creation. If you’re building long-running pipelines, make sure you’re pulling results before they disappear. Store the batch ID and submission timestamp persistently — don’t rely on in-memory state.
No streaming: Results only become available after the batch fully ends. You can’t peek at partial results mid-run. For 24-hour jobs this is fine; for 4-hour jobs it’s a workflow consideration.
Rate limits on batch creation: You’re still subject to API rate limits on the batch creation calls themselves. Space your batch submissions 2–5 seconds apart if submitting many batches in quick succession.
When to Use Batch Processing vs Standard API
Use the batch API when: you have 500+ documents to process, results can wait hours, you’re running scheduled jobs (nightly classification, weekly report generation), or you’re doing initial data processing on a static corpus.
Stick with standard synchronous API when: you need responses in under 30 seconds, you’re serving user-facing features, you need to chain outputs immediately into subsequent calls, or you’re running fewer than a few hundred requests where the operational overhead isn’t worth it.
The sweet spot for Claude batch API processing is any ETL-style workload: pulling documents from S3 or a database, enriching them with AI-extracted metadata, writing back to a data warehouse. This pattern runs overnight, costs half as much, and requires zero infrastructure beyond a simple polling script.
Choosing the Right Model for Batch Jobs
For extraction and classification: Haiku is your default. It’s fast, cheap, and handles well-structured prompts reliably. At batch pricing, 10K documents with 500-token average input costs about $1–3 depending on output length.
For summarisation or analysis requiring reasoning: Sonnet at batch pricing is genuinely compelling — half of what you’d pay for real-time Sonnet, and the quality gap over Haiku is real for nuanced tasks. I’d use Sonnet for anything that needs multi-step reasoning, code analysis, or long-document summarisation.
For the highest-stakes content where quality is non-negotiable: Opus batch pricing makes large-scale Opus usage feasible for the first time. If you’re processing legal documents, medical records, or financial filings where errors are costly, the 50% batch discount on Opus brings it into budget for many use cases.
Bottom line for different reader types: If you’re a solo founder with a document-heavy product — contract analysis, invoice processing, content moderation — the batch API is the most immediate infrastructure win available. You can cut your AI costs in half overnight with 2–3 hours of implementation work. If you’re on a team running production pipelines, wrap this in a proper job queue (SQS, Redis, whatever you already use), add dead-letter handling for the failed IDs, and you have an enterprise-grade document processing pipeline for almost nothing. The code above is production-ready with minor additions; don’t let anyone sell you a complex architecture when a polling script and a JSONL file will handle 99% of batch workloads.
Editorial note: API pricing, model capabilities, and tool features change frequently — always verify current details on the vendor’s website before building in production. Code examples are tested at time of writing; pin your dependency versions to avoid breaking changes. Some links in this article may be affiliate links — we may earn a commission if you sign up, at no extra cost to you.

