Multi-Agent Workflows in Production: Orchestrating Multiple Claude Agents with Message Passing

Single-agent Claude setups break down fast in production. The moment you need to research a topic, validate the output, format it for multiple channels, and route it to the right destination — all in one coherent workflow — you’re either stuffing an absurd amount of context into one prompt or watching quality degrade as the model tries to juggle too many responsibilities. Multi-agent Claude orchestration solves this by distributing cognitive load across specialized agents that communicate through structured message passing. This article covers the architectural patterns that actually work: routing, delegation, consensus, and shared state management — with working Python code you can adapt today.

Why Single Agents Hit a Wall

The failure mode is predictable. You start with one Claude prompt that does everything. It works fine for demos. In production, you notice the model starts making tradeoffs — it abbreviates the research section to leave room for formatting, or it skips edge-case validation because the context window is already dense. Claude Sonnet has a 200K token context, which sounds enormous until you’re feeding it 50 pages of source material, a 2,000-token system prompt, tool definitions, and conversation history.

The smarter move is to decompose the problem. Each agent gets a narrow, well-defined job. A router agent classifies incoming tasks. A research agent fetches and summarizes information. A critic agent evaluates output against a rubric. A formatter agent handles output transformation. They communicate by passing structured messages — JSON objects with a defined schema — rather than free-form prose. This is not theoretical. Teams shipping Claude in production are doing this today, and the architecture decisions you make early determine how painful the maintenance becomes later.

The Core Architecture: Message-Passing Between Agents

Before writing any code, nail down your message schema. Every inter-agent message should carry: a task ID, the sender, the recipient, the message type, the payload, and metadata like timestamps and retry counts. Skipping this in early prototypes always comes back to bite you when you’re trying to debug why agent B is acting on stale data from agent A.

from dataclasses import dataclass, field
from typing import Any, Optional
import uuid
from datetime import datetime

@dataclass
class AgentMessage:
    task_id: str
    sender: str
    recipient: str
    message_type: str  # "task", "result", "error", "consensus_request"
    payload: dict[str, Any]
    timestamp: str = field(default_factory=lambda: datetime.utcnow().isoformat())
    message_id: str = field(default_factory=lambda: str(uuid.uuid4()))
    retry_count: int = 0
    parent_message_id: Optional[str] = None

The parent_message_id field is worth the extra few bytes — it lets you reconstruct the full conversation chain for debugging and billing attribution. At roughly $0.003 per 1K output tokens with Sonnet 3.5, a runaway agent loop that makes 200 unnecessary calls will cost you real money. Message tracing catches this early.

Building the Router Agent

The router is the entry point. It receives raw user requests, classifies them, and delegates to the right specialized agent. Keep the router’s system prompt tight — its only job is classification and routing, not answering the question itself.

import anthropic
import json

client = anthropic.Anthropic()

ROUTER_SYSTEM = """You are a routing agent. Classify the incoming task and return a JSON object with:
- "agent": one of ["research", "code_review", "summarizer", "critic"]
- "priority": one of ["high", "normal", "low"]
- "reasoning": one sentence explaining your choice

Return only valid JSON. No prose."""

def route_task(user_request: str) -> dict:
    response = client.messages.create(
        model="claude-haiku-3-5",  # Use Haiku for cheap, fast routing
        max_tokens=200,
        system=ROUTER_SYSTEM,
        messages=[{"role": "user", "content": user_request}]
    )
    return json.loads(response.content[0].text)

Notice the model choice: Haiku for routing, Sonnet for heavy lifting. Haiku currently runs at roughly $0.00025 per 1K input tokens — you’re paying almost nothing to classify tasks. Doing this with Sonnet is a common waste. The router doesn’t need deep reasoning; it needs to be fast and consistent.

Delegation Patterns That Hold Up in Production

There are three delegation patterns worth knowing. The one you pick depends on whether tasks can run in parallel, whether order matters, and how you handle failures.

Sequential Delegation (Pipeline)

Agent A completes its work, passes the result to Agent B, which passes to Agent C. Simple, predictable, easy to debug. The downside is latency — you wait for each agent in turn. Use this when each step genuinely depends on the previous output (e.g., research → draft → critique → final edit).

def run_pipeline(task: str) -> str:
    # Step 1: Research agent gathers context
    research_result = call_agent(
        agent_name="research",
        model="claude-sonnet-4-5",
        system="You are a research agent. Gather relevant facts and return structured findings.",
        task=task
    )

    # Step 2: Drafting agent uses research output
    draft_result = call_agent(
        agent_name="drafter",
        model="claude-sonnet-4-5",
        system="You are a drafting agent. Write a clear response using the provided research.",
        task=f"Research findings:\n{research_result}\n\nOriginal task: {task}"
    )

    # Step 3: Critic validates the draft
    final_result = call_agent(
        agent_name="critic",
        model="claude-haiku-3-5",  # Criticism doesn't need Sonnet
        system="Review this draft. Flag factual errors, logical gaps, or unclear sections. Return the corrected version.",
        task=draft_result
    )

    return final_result

def call_agent(agent_name: str, model: str, system: str, task: str) -> str:
    response = client.messages.create(
        model=model,
        max_tokens=2048,
        system=system,
        messages=[{"role": "user", "content": task}]
    )
    return response.content[0].text

Parallel Fan-Out with Aggregation

When subtasks are independent, run them concurrently and aggregate results. This requires async execution and a merging step. The tricky part isn’t the parallelism — Python’s asyncio handles that — it’s writing a merge agent that reconciles potentially contradictory outputs without hallucinating a consensus that doesn’t exist.

import asyncio

async def run_parallel_analysis(topic: str) -> dict:
    async def analyze_aspect(aspect: str, system_prompt: str) -> tuple[str, str]:
        # Async wrapper around the synchronous Anthropic client
        loop = asyncio.get_event_loop()
        result = await loop.run_in_executor(
            None,
            lambda: call_agent("analyst", "claude-sonnet-4-5", system_prompt, topic)
        )
        return aspect, result

    tasks = [
        analyze_aspect("technical", "Analyze the technical feasibility of this topic."),
        analyze_aspect("business", "Analyze the business implications of this topic."),
        analyze_aspect("risks", "Identify the top 3 risks associated with this topic."),
    ]

    results = await asyncio.gather(*tasks)
    return dict(results)

# Usage: cuts wall-clock time by ~60% vs sequential
analysis = asyncio.run(run_parallel_analysis("building a RAG pipeline for legal documents"))

Consensus Patterns: Getting Multiple Agents to Agree

Some decisions are too important to trust to a single agent call. Code security reviews, factual claims that will be published, financial projections — these benefit from a consensus round where multiple agents independently evaluate the same question and a judge agent resolves disagreements.

The pattern: send the same question to N agents (typically 3), collect their responses, then pass all three to a fourth “judge” agent that identifies points of agreement and explicitly surfaces disagreements for human review.

def get_consensus(question: str, n_agents: int = 3) -> dict:
    # Get independent assessments
    assessments = []
    for i in range(n_agents):
        response = client.messages.create(
            model="claude-sonnet-4-5",
            max_tokens=500,
            system=f"You are analyst #{i+1}. Provide an independent assessment. Be direct. End with VERDICT: [APPROVE/REJECT/UNCERTAIN]",
            messages=[{"role": "user", "content": question}]
        )
        assessments.append(response.content[0].text)

    # Judge agent resolves
    judge_input = "\n\n---\n\n".join([f"Analyst {i+1}:\n{a}" for i, a in enumerate(assessments)])
    judge_response = client.messages.create(
        model="claude-sonnet-4-5",
        max_tokens=800,
        system="""You are a judge agent. Review multiple analyst assessments and return JSON with:
- "consensus": "APPROVE", "REJECT", or "SPLIT"
- "confidence": 0.0 to 1.0
- "key_agreements": list of points all analysts agreed on
- "key_disagreements": list of points where analysts diverged
- "recommendation": one clear sentence""",
        messages=[{"role": "user", "content": f"Assessments to review:\n\n{judge_input}"}]
    )

    return json.loads(judge_response.content[0].text)

This pattern costs roughly 4x a single agent call. On Sonnet 3.5, for a 500-token question with 500-token responses, you’re looking at around $0.012 per consensus round. Expensive for high-volume workflows, worth it for high-stakes decisions. Don’t use consensus for anything that runs more than a few hundred times a day without carefully modeling your costs first.

State Management Across Agent Calls

Agents are stateless by default — each API call knows nothing about the previous one. You have three options for maintaining shared state, each with real tradeoffs.

In-Memory State (Fast, Fragile)

A shared Python dict or a dataclass passed between function calls. Fine for short-lived workflows that complete in a single process. Dies the moment you add horizontal scaling or the process crashes. Don’t use this if your workflow spans more than a few seconds or crosses service boundaries.

Redis as a Shared Blackboard

Each agent reads from and writes to a Redis key namespaced by task ID. Agents don’t need to know about each other — they just update the shared state. This is the pattern I’d recommend for most production setups. It’s fast (sub-millisecond reads), survives process restarts, and the TTL feature automatically cleans up completed workflows.

import redis
import json

r = redis.Redis(host='localhost', port=6379, decode_responses=True)

def update_task_state(task_id: str, agent_name: str, result: dict):
    key = f"task:{task_id}:state"
    current = json.loads(r.get(key) or '{}')
    current[agent_name] = {
        "result": result,
        "completed_at": datetime.utcnow().isoformat()
    }
    r.setex(key, 3600, json.dumps(current))  # 1-hour TTL

def get_task_state(task_id: str) -> dict:
    key = f"task:{task_id}:state"
    return json.loads(r.get(key) or '{}')

Database-Backed State (Auditable, Slower)

Write every state transition to Postgres or similar. Necessary for compliance use cases where you need a full audit trail. Adds latency (typically 5-20ms per write) and complexity. Use when you need to answer “exactly what did agent X do at 14:32 on Tuesday” six months from now.

What Breaks in Production (And How to Handle It)

The failure modes that catch people off guard:

Agent loops: Agent A delegates to Agent B, which delegates back to Agent A. Add a max-depth counter to every message and raise an exception at depth 5+.
JSON parsing failures: Claude occasionally returns prose instead of JSON, especially when the prompt is ambiguous. Always wrap JSON parsing in a try/except and implement a retry with a more explicit prompt.
Context bleed: When you’re summarizing previous agent outputs into new prompts, summaries can introduce errors or lose nuance. Keep raw outputs alongside summaries in your state object.
Rate limits under parallel load: Fan-out patterns that fire 10 simultaneous Sonnet calls will hit Anthropic’s rate limits. Implement a token bucket or use a library like tenacity for exponential backoff.
Cost overruns from retry storms: A bug that causes every agent to retry 3 times can triple your costs before you notice. Set hard token budgets per task ID and abort when exceeded.

When to Use Multi-Agent Claude Orchestration

Be honest with yourself before adding agents: does your workflow genuinely need multiple specialized agents, or are you over-engineering? A single well-prompted Claude call handles the majority of tasks that developers initially reach for multi-agent systems to solve.

Reach for multi-agent Claude orchestration when:

Your task naturally decomposes into sequential or parallel steps with different quality requirements
You need independent validation or consensus on high-stakes outputs
Different parts of the workflow have vastly different latency or cost tolerances (route cheap/fast tasks to Haiku, expensive reasoning to Sonnet)
You’re hitting context window limits because a single agent needs to carry too much state
You want to A/B test different agents on the same subtask without restructuring your whole system

Stick with a single agent when: the task can be completed in one focused prompt, you’re still in early prototyping, or the coordination overhead would cost more than the quality improvement is worth.

For solo founders and small teams: start with the pipeline pattern. It’s the easiest to debug and maintain. Move to fan-out parallelism only when you’ve proven the bottleneck is latency, not logic. Add consensus patterns only for truly high-stakes decisions — the cost adds up fast at scale. The goal of multi-agent Claude orchestration isn’t complexity; it’s matching the right model, at the right cost, to the right part of the problem.

Editorial note: API pricing, model capabilities, and tool features change frequently — always verify current details on the vendor’s website before building in production. Code examples are tested at time of writing; pin your dependency versions to avoid breaking changes. Some links in this article may be affiliate links — we may earn a commission if you sign up, at no extra cost to you.

Multi-Agent Workflows in Production: Orchestrating Multiple Claude Agents with Message Passing

Context Window Comparison 2025: Claude 200K vs GPT-4 Turbo vs Gemini 2 Million Tokens

Activepieces vs n8n vs Zapier: Building AI Automation Workflows Compared

Mistral Large vs Claude 3.5 Sonnet: Summarization and Compression Benchmark

Role Prompting vs Chain-of-Thought vs Constitutional AI: Best Prompt Technique for Agents

Claude Haiku vs GPT-4o Mini: Small Model Showdown for Cost-Conscious Agents

Helicone vs LangSmith vs Langfuse: LLM Observability Platform Comparison

Multi-Agent Workflows in Production: Orchestrating Multiple Claude Agents with Message Passing

Why Single Agents Hit a Wall

The Core Architecture: Message-Passing Between Agents

Building the Router Agent

Delegation Patterns That Hold Up in Production

Sequential Delegation (Pipeline)

Parallel Fan-Out with Aggregation

Consensus Patterns: Getting Multiple Agents to Agree

State Management Across Agent Calls

In-Memory State (Fast, Fragile)

Redis as a Shared Blackboard

Database-Backed State (Auditable, Slower)

What Breaks in Production (And How to Handle It)

When to Use Multi-Agent Claude Orchestration

Related Posts

Context Window Comparison 2025: Claude 200K vs GPT-4 Turbo vs Gemini 2 Million Tokens

Activepieces vs n8n vs Zapier: Building AI Automation Workflows Compared

Mistral Large vs Claude 3.5 Sonnet: Summarization and Compression Benchmark

Role Prompting vs Chain-of-Thought vs Constitutional AI: Best Prompt Technique for Agents

Claude Haiku vs GPT-4o Mini: Small Model Showdown for Cost-Conscious Agents

Helicone vs LangSmith vs Langfuse: LLM Observability Platform Comparison