Saturday, March 21

Your Claude agent works perfectly in testing. Then it hits production and you discover that Anthropic’s API has a 529 overload error at 2am, your retry logic hammers the endpoint and burns through rate limits, and the whole workflow silently dies. Claude agent error handling patterns are the difference between a demo that impresses and a system that actually runs. This article covers what I’ve learned shipping agents that handle thousands of calls per day — the specific patterns, the code, and the failure modes nobody documents.

Why Claude Agents Fail Differently Than Regular APIs

LLM API errors have a different character than typical REST API failures. You’re not just dealing with network timeouts and 5xx errors. You’re dealing with rate limits that vary by tier, context windows that overflow mid-conversation, outputs that parse incorrectly because the model decided to be creative, and latency spikes that can stretch a normally 2-second call to 45 seconds under load.

The failure modes I see most often in production Claude deployments:

  • 529 Overloaded — Anthropic’s signal that they’re at capacity. Not a rate limit on your account; it’s infrastructure-level. Backing off and retrying usually works within 30–60 seconds.
  • 429 Rate Limited — You’ve exceeded your token or request quota. The response includes a retry-after header you should actually read.
  • Timeout on long outputs — Claude 3.5 Sonnet generating a 4000-token response can take 30+ seconds. Default HTTP timeouts will kill this.
  • Malformed JSON in structured outputs — Claude mostly behaves, but under specific prompt conditions it’ll wrap JSON in markdown fences, add commentary, or truncate mid-object.
  • Context overflow — You hit 200k tokens and your agent crashes instead of handling the limit gracefully.

Each of these needs a different response. Treating them all as “retry three times then give up” is how you build fragile systems.

The Retry Layer: Exponential Backoff With Jitter

The standard exponential backoff pattern works, but you need jitter — without it, multiple concurrent agent instances will sync up and retry simultaneously, creating request thundering herds that make the overload worse.

Here’s a production-ready retry wrapper I use with the Anthropic SDK:

import anthropic
import time
import random
import logging
from typing import Optional

logger = logging.getLogger(__name__)

def call_claude_with_retry(
    client: anthropic.Anthropic,
    messages: list,
    model: str = "claude-3-5-sonnet-20241022",
    max_tokens: int = 1024,
    max_retries: int = 4,
    base_delay: float = 1.0,
    max_delay: float = 60.0,
) -> Optional[anthropic.types.Message]:
    
    for attempt in range(max_retries):
        try:
            response = client.messages.create(
                model=model,
                max_tokens=max_tokens,
                messages=messages,
                timeout=120.0,  # explicit timeout; default is too low for long outputs
            )
            return response
            
        except anthropic.RateLimitError as e:
            # Read the retry-after header if available
            retry_after = getattr(e, 'retry_after', None)
            wait = retry_after if retry_after else (base_delay * (2 ** attempt))
            wait = min(wait, max_delay)
            wait += random.uniform(0, wait * 0.1)  # add 10% jitter
            
            logger.warning(f"Rate limited (attempt {attempt + 1}/{max_retries}). Waiting {wait:.1f}s")
            time.sleep(wait)
            
        except anthropic.APIStatusError as e:
            if e.status_code == 529:  # overloaded
                wait = min(base_delay * (2 ** attempt), max_delay)
                wait += random.uniform(0, wait * 0.2)  # more jitter for overload
                logger.warning(f"API overloaded (attempt {attempt + 1}/{max_retries}). Waiting {wait:.1f}s")
                time.sleep(wait)
            elif e.status_code in (400, 401, 403):
                # Don't retry auth errors or bad requests
                logger.error(f"Non-retryable error {e.status_code}: {e.message}")
                raise
            else:
                wait = min(base_delay * (2 ** attempt), max_delay)
                time.sleep(wait)
                
        except anthropic.APITimeoutError:
            wait = min(base_delay * (2 ** attempt), max_delay)
            logger.warning(f"Timeout (attempt {attempt + 1}/{max_retries}). Waiting {wait:.1f}s")
            time.sleep(wait)
    
    logger.error(f"All {max_retries} retry attempts exhausted")
    return None  # caller decides what to do with None

A few things worth calling out. The timeout=120.0 is explicit — the SDK default is 10 minutes, which sounds generous but can hang your event loop if you’re running async. Set it to something that matches your SLA. The 529 handling uses more jitter than the rate limit path because overload conditions tend to persist for multiple agents simultaneously.

Also notice that 400, 401, and 403 errors raise immediately. Retrying a bad API key or a malformed request is a waste of time and budget. Not all errors are worth retrying — this is the mistake most people make when they write “retry on any exception.”

Fallback Models: When Claude Isn’t Available

For business-critical workflows, a single model dependency is a liability. My preferred pattern is a fallback chain: try your primary model, fall back to a lighter Claude model, then optionally fall back to a different provider entirely.

from typing import Callable, Any
import openai  # fallback provider

FALLBACK_CHAIN = [
    {
        "provider": "anthropic",
        "model": "claude-3-5-sonnet-20241022",
        "client_key": "anthropic_client",
    },
    {
        "provider": "anthropic", 
        "model": "claude-3-haiku-20240307",  # cheaper, faster, usually available
        "client_key": "anthropic_client",
    },
    {
        "provider": "openai",
        "model": "gpt-4o-mini",  # cross-provider fallback
        "client_key": "openai_client",
    },
]

def call_with_fallback(
    prompt: str,
    system: str,
    clients: dict,  # {"anthropic_client": ..., "openai_client": ...}
    task_adapter: Callable,  # normalises responses to a common format
) -> dict:
    
    last_error = None
    
    for config in FALLBACK_CHAIN:
        try:
            logger.info(f"Attempting {config['provider']}/{config['model']}")
            client = clients[config["client_key"]]
            result = task_adapter(client, config, prompt, system)
            
            if config["model"] != FALLBACK_CHAIN[0]["model"]:
                logger.warning(f"Using fallback model: {config['model']}")
                # consider alerting here — fallback usage is a signal
            
            return {"result": result, "model_used": config["model"]}
            
        except Exception as e:
            last_error = e
            logger.warning(f"Model {config['model']} failed: {e}. Trying next.")
            continue
    
    raise RuntimeError(f"All models in fallback chain failed. Last error: {last_error}")

The task_adapter is important — it abstracts the response format differences between providers. Anthropic returns response.content[0].text, OpenAI returns response.choices[0].message.content. Your business logic shouldn’t care which model answered.

Cost reality check: running Claude 3 Haiku as a fallback costs roughly $0.00025 per 1K input tokens vs Sonnet’s $0.003 — about 12x cheaper. For tasks where quality matters less than availability, Haiku is a legitimate primary option, not just a fallback.

Graceful Degradation for Agent Workflows

Handling Structured Output Failures

When you’re extracting structured data from Claude, the model will occasionally produce output that doesn’t parse. The naive approach is to crash. The production approach is to have a repair step:

import json
import re

def parse_json_response(raw_text: str) -> dict:
    """
    Attempts to extract valid JSON from a Claude response,
    even when the model wraps it in markdown or adds commentary.
    """
    # First attempt: direct parse
    try:
        return json.loads(raw_text.strip())
    except json.JSONDecodeError:
        pass
    
    # Second attempt: extract from markdown fences
    fence_match = re.search(r"```(?:json)?\s*([\s\S]*?)```", raw_text)
    if fence_match:
        try:
            return json.loads(fence_match.group(1).strip())
        except json.JSONDecodeError:
            pass
    
    # Third attempt: find first { ... } block
    brace_match = re.search(r"\{[\s\S]*\}", raw_text)
    if brace_match:
        try:
            return json.loads(brace_match.group(0))
        except json.JSONDecodeError:
            pass
    
    # If all parsing fails, raise with the raw text for debugging
    raise ValueError(f"Could not extract JSON from response: {raw_text[:200]}")

In practice, step two (the markdown fence extraction) catches about 90% of “unexpected” Claude output failures. The model has been trained on so much code that it reflexively adds fences to structured output, especially if your prompt mentions JSON in a way that implies a code context.

Circuit Breaker Pattern

Retrying indefinitely is worse than failing fast when the API is down for an extended period. A circuit breaker tracks consecutive failures and stops sending requests when a threshold is crossed, giving the upstream service time to recover:

from dataclasses import dataclass, field
from datetime import datetime, timedelta

@dataclass
class CircuitBreaker:
    failure_threshold: int = 5
    recovery_timeout: int = 60  # seconds before trying again
    _failures: int = field(default=0, init=False)
    _state: str = field(default="closed", init=False)  # closed = normal operation
    _opened_at: Optional[datetime] = field(default=None, init=False)
    
    def call(self, func, *args, **kwargs):
        if self._state == "open":
            if datetime.now() > self._opened_at + timedelta(seconds=self.recovery_timeout):
                self._state = "half-open"
                logger.info("Circuit breaker entering half-open state")
            else:
                raise RuntimeError("Circuit breaker open — skipping API call")
        
        try:
            result = func(*args, **kwargs)
            if self._state == "half-open":
                self._reset()
                logger.info("Circuit breaker closed — service recovered")
            return result
        except Exception as e:
            self._record_failure()
            raise
    
    def _record_failure(self):
        self._failures += 1
        if self._failures >= self.failure_threshold:
            self._state = "open"
            self._opened_at = datetime.now()
            logger.error(f"Circuit breaker opened after {self._failures} failures")
    
    def _reset(self):
        self._failures = 0
        self._state = "closed"
        self._opened_at = None

Wire this in front of your retry wrapper. The circuit breaker sits at the service level; retry logic sits at the request level. They’re complementary, not alternatives.

Timeout Management and Streaming

For long-running generations, streaming is often the better answer than increasing your timeout. It gives you partial results, lets you detect stalls mid-stream, and keeps connections alive through proxies that would otherwise close idle connections:

def stream_with_timeout(client, messages, model, max_tokens, char_timeout=10.0):
    """
    Streams a Claude response and raises if no tokens arrive for `char_timeout` seconds.
    Useful for detecting mid-stream stalls without a global timeout.
    """
    import signal
    
    def timeout_handler(signum, frame):
        raise TimeoutError(f"No tokens received for {char_timeout}s — stream stalled")
    
    full_response = []
    
    with client.messages.stream(
        model=model,
        max_tokens=max_tokens,
        messages=messages,
    ) as stream:
        for text_chunk in stream.text_stream:
            signal.alarm(0)  # reset timer on each chunk
            signal.signal(signal.SIGALRM, timeout_handler)
            signal.alarm(int(char_timeout))  # restart timer
            full_response.append(text_chunk)
    
    signal.alarm(0)  # clear timer on completion
    return "".join(full_response)

Note: signal.SIGALRM is Unix-only. For Windows or async contexts, use asyncio.wait_for with streaming coroutines instead. The underlying principle — reset a timer on each chunk, raise if you go silent — translates to any runtime.

Observability: You Can’t Debug What You Can’t See

All of these patterns produce signals you should be collecting. At minimum, track: model used (especially fallback activations), attempt count per request, latency percentiles, and error type distribution. A sudden spike in 529s is an early warning to implement queuing or reduce throughput. A rise in JSON parse failures is a signal your prompt has drifted.

I log a structured dict after every Claude call in production:

metrics = {
    "model_requested": "claude-3-5-sonnet-20241022",
    "model_used": response.model,  # may differ if fallback kicked in
    "attempts": attempt_count,
    "input_tokens": response.usage.input_tokens,
    "output_tokens": response.usage.output_tokens,
    "latency_ms": elapsed * 1000,
    "cost_usd": calculate_cost(response),  # your pricing lookup
    "error_types": error_log,  # list of errors encountered before success
}

Ship this to whatever observability stack you use — Datadog, Grafana, even a simple Postgres table works. Cost tracking at the call level will save you from billing surprises; at current Sonnet pricing, a 10K-token call costs roughly $0.042, and it’s easy to lose track of that at scale.

When to Use Which Pattern

A quick decision guide based on your situation:

  • Solo founder, small volume (<1K calls/day): Implement retry with backoff and JSON repair. That covers 95% of real failures. Skip the circuit breaker until you need it.
  • Team with user-facing product: Add the fallback chain — users noticing downtime is expensive. Haiku as a fallback keeps you live at minimal cost.
  • High-volume pipeline (>10K calls/day): All of the above, plus circuit breaker and full observability. At that volume, a 1% failure rate without visibility is a serious problem.
  • n8n / Make automation users: Most of these patterns need custom code nodes. The retry and timeout settings in the built-in Claude nodes are often not configurable enough — wrap the API call in a custom function node using the patterns above.

Bottom line: Production Claude agent error handling patterns aren’t glamorous to build, but they’re what separates a system that runs for six months from one you’re babysitting every week. Start with the retry wrapper and JSON parser — both are copy-paste ready above. Add the fallback chain when uptime becomes a real business requirement. Layer in circuit breaking and observability as volume grows. The patterns compound; each one makes the next one more effective.

Editorial note: API pricing, model capabilities, and tool features change frequently — always verify current details on the vendor’s website before building in production. Code examples are tested at time of writing; pin your dependency versions to avoid breaking changes. Some links in this article may be affiliate links — we may earn a commission if you sign up, at no extra cost to you.

Share.
Leave A Reply