Prompt Chaining vs Prompt Composition: Structuring Complex Multi-Step Agent Workflows

Most agent workflows fail not because the prompts are bad, but because the structure is wrong. You’ve probably seen both failure modes: a single monolithic prompt trying to do too much and hallucinating halfway through, or a chain of fifteen sequential API calls burning tokens and latency on tasks that could have been combined. Getting your prompt chaining composition workflows right is the difference between an agent that actually ships and one that’s perpetually “almost working.” This article draws a hard line between the two patterns, shows you when each one wins, and gives you working code you can drop into your next build.

What We’re Actually Talking About

The terms get conflated constantly, so let’s nail the definitions before anything else.

Prompt chaining is sequential: the output of one prompt becomes the input of the next. Each step is a discrete call to the model. You pass context forward explicitly. The model doesn’t know anything about what came before unless you include it in the next prompt.

Prompt composition is structural: you build a single, well-formed prompt from multiple components — a system instruction, retrieved context, few-shot examples, user input, constraints — and fire it as one request. The model processes everything in one context window, in one inference pass.

These aren’t just stylistic choices. They have real differences in token cost, latency, coherence, and debuggability. Using the wrong one for a given task is a common source of both unnecessary expense and flaky outputs.

When Chaining Is the Right Call

Chaining shines when your task has true sequential dependencies — where step two genuinely cannot start without the output of step one, and that output needs to be evaluated or transformed before moving forward.

The Classic Use Case: Research → Draft → Edit

Say you’re building a content pipeline. You need to: pull key claims from a source document, then draft a summary, then rewrite it for tone. This is a natural chain because each step meaningfully transforms the data, and you may want to insert validation logic (or a human checkpoint) between steps.

import anthropic

client = anthropic.Anthropic()

def run_chain(source_text: str) -> dict:
    # Step 1: Extract key claims
    extract_response = client.messages.create(
        model="claude-3-5-haiku-20241022",
        max_tokens=512,
        messages=[{
            "role": "user",
            "content": f"Extract the 5 most important factual claims from this text. Return as a numbered list only.\n\n{source_text}"
        }]
    )
    claims = extract_response.content[0].text

    # Step 2: Draft a summary using those claims
    draft_response = client.messages.create(
        model="claude-3-5-haiku-20241022",
        max_tokens=512,
        messages=[{
            "role": "user",
            "content": f"Write a 3-sentence summary using only these claims:\n{claims}"
        }]
    )
    draft = draft_response.content[0].text

    # Step 3: Rewrite for a technical audience
    final_response = client.messages.create(
        model="claude-3-5-haiku-20241022",
        max_tokens=512,
        messages=[{
            "role": "user",
            "content": f"Rewrite this summary for a technical B2B audience. Be direct, no fluff:\n{draft}"
        }]
    )

    return {
        "claims": claims,
        "draft": draft,
        "final": final_response.content[0].text
    }

This costs roughly $0.003–$0.005 total at current Haiku pricing for typical article-length input. Three small calls instead of one big one. The key benefit here isn’t just cost — it’s that you can inspect and log claims and draft independently. When the output is wrong, you know exactly which step broke.

Other Cases Where Chaining Wins

Tool-use loops: The model calls a tool, you get a result, you feed it back. This is inherently sequential and n8n / Make both model this natively as node chains.
Conditional routing: A classifier step determines which downstream prompt to call. You can’t compose that — you don’t know which branch to include until the first step runs.
Long document processing: Chunking a 50-page document and running extraction per chunk, then aggregating, avoids context window limits.
Multi-agent handoffs: Agent A produces a plan; Agent B executes it. The separation is deliberate — different roles, possibly different models.

When Composition Is the Right Call

Composition is underused because people default to chaining out of habit. But when your task doesn’t have true sequential dependencies, chaining wastes tokens and adds latency for no reason.

Composition wins when you need the model to hold multiple pieces of context in relation to each other simultaneously. Classification with context, answering a question given retrieved docs, structured extraction from a complex schema — these all benefit from the model seeing everything at once.

The Composition Pattern: Building a Prompt Programmatically

from string import Template

def build_extraction_prompt(
    schema: dict,
    examples: list[dict],
    user_input: str,
    retrieved_docs: list[str]
) -> str:
    """
    Compose a single prompt from multiple components.
    No chaining needed — the model sees everything in one pass.
    """

    # Format few-shot examples
    example_block = "\n\n".join([
        f"Input: {ex['input']}\nOutput: {ex['output']}"
        for ex in examples
    ])

    # Format retrieved context
    context_block = "\n---\n".join(retrieved_docs[:3])  # cap at 3 to control tokens

    system_prompt = (
        "You are a structured data extraction engine. "
        "Return valid JSON matching the schema provided. "
        "Do not include any explanation — JSON only."
    )

    user_prompt = f"""Extract structured data from the user input below.

SCHEMA:
{schema}

REFERENCE CONTEXT:
{context_block}

EXAMPLES:
{example_block}

USER INPUT:
{user_input}"""

    return system_prompt, user_prompt

# Usage
system, user = build_extraction_prompt(
    schema={"name": "str", "date": "ISO8601", "amount": "float"},
    examples=[
        {"input": "Paid John $50 on March 3rd", "output": '{"name":"John","date":"2024-03-03","amount":50.0}'}
    ],
    retrieved_docs=["Company policy: all amounts in USD"],
    user_input="Sent invoice to Sarah for 120 dollars, April 12"
)

response = client.messages.create(
    model="claude-3-5-haiku-20241022",
    max_tokens=256,
    system=system,
    messages=[{"role": "user", "content": user}]
)

One API call. The model reasons over the schema, examples, and retrieved context together — which is exactly what you want for extraction tasks. If you chained this (extract schema understanding → fetch examples → run extraction), you’d pay for three calls and introduce the risk that context bleeds or gets lost between hops.

Where Composition Beats Chaining on Token Efficiency

Here’s the thing people miss: when you chain, you’re often repeating context. If your step-three prompt needs to know what happened in step one, you’re re-sending that content. With composition, you include it once. For a workflow processing 1,000 inputs per day with 800 tokens of shared context, that’s potentially 1.6M tokens/day in savings just from eliminating re-transmission.

Hybrid Patterns: The Real World Isn’t Clean

Production systems almost always use both. The architecture usually looks like this: a chain at the macro level (plan → execute → verify), with composition happening inside each step (each step prompt is assembled from dynamic components).

A Practical n8n Example

In n8n, you’d model this as a workflow where each node represents a chain step. Inside each “AI Agent” node or “HTTP Request” node calling the Claude API, your code builds the prompt via composition — injecting retrieved context, templates, and user data before sending. The chain is the node graph; the composition is the code inside each node.

The mistake I see in n8n builds is people creating a separate API call node for every logical sub-task, even when those sub-tasks have no intermediate dependencies. Three “Extract → Format → Classify” nodes where a single composed prompt would do the same job with one call, less latency, and no hand-off errors.

Using Composition to Reduce Chain Length

# BEFORE: Three-step chain where steps 2 and 3 have no real dependency on each other
# Total: ~3 API calls, ~1200 tokens

# Step 1: Classify intent
# Step 2: Format response style  
# Step 3: Apply tone

# AFTER: Compose steps 2 and 3 into the original prompt
# Total: 1 API call, ~600 tokens

composed_system = """You are a customer support assistant.
- Classify the user's intent as one of: [billing, technical, general]
- Respond in friendly but concise language
- Keep response under 3 sentences
Return a JSON object: {"intent": "...", "response": "..."}"""

# Intent classification, formatting, and tone are all handled in one pass.
# You only chain if the intent classification needs to route to a 
# completely different prompt or data source.

This is the most common optimization I make when reviewing agent code. People chain things that should be composed. The tell: if removing a step’s output doesn’t change what the next step does — it’s not a real chain dependency. Merge them.

Failure Modes Worth Knowing Before You Ship

Chaining failure: context drift. Each time you forward context, you’re summarizing or truncating. By step six of a chain, the model may be reasoning from a degraded version of the original data. Mitigate by passing the original inputs forward explicitly when they matter, not just the transformed output.

Chaining failure: error propagation. A wrong extraction in step one poisons every downstream step. Always validate intermediate outputs before feeding them forward. A simple JSON schema check or assertion is enough in most cases.

Composition failure: prompt bloat. Stuffing too many components into one prompt — full documents, long example sets, complex schemas — can cause the model to underweight parts of the prompt. The “lost in the middle” problem is real: Claude and GPT-4 both tend to underweight content in the middle of a long context. Keep composed prompts focused. If you’re exceeding 4K tokens in a composed prompt regularly, audit what’s actually being used.

Composition failure: rigid templates. If your prompt builder doesn’t handle missing optional components gracefully, you’ll hit issues when retrieved docs come back empty or examples aren’t available for a given category. Always build in fallbacks.

Decision Framework: Which Pattern to Use

Run through this checklist when designing a new workflow:

Does step N genuinely require the output of step N-1? If yes, you have a real chain dependency. If no, consider composing them.
Do you need to branch based on an intermediate result? Chain it — you can’t compose conditional logic.
Are you re-sending the same context across multiple calls? That’s a signal to compose more aggressively.
Do you need auditability at each step? Chaining gives you natural checkpoints. Composition is a black box from the outside.
Is latency a user-facing concern? Composition is always faster for equivalent tasks — single round-trip vs multiple.

Bottom Line: Who Should Use What

If you’re a solo founder building an MVP: default to composition until you hit a real sequential dependency. You’ll ship faster, spend less on API calls, and have fewer moving parts to debug. Most simple agents — classifiers, extractors, responders — can be handled in one well-composed prompt.

If you’re building a multi-step autonomous agent: you need chaining at the macro level. But aggressively compose within each step. Treat each chain node as an opportunity to consolidate — if two adjacent steps could share a context window without losing anything, merge them.

If you’re on a tight token budget (processing high volumes, watching costs): audit every chain for unnecessary context retransmission. Optimizing your prompt chaining composition workflows at this level can cut costs by 30–50% in pipelines I’ve reviewed, without changing output quality.

If you’re building in n8n or Make: use the node graph for true chain steps only. Pre-assemble your prompts in a Code node or Function node before hitting the model API. Don’t let the visual metaphor of “nodes = steps” push you into over-chaining.

The right structure is the one that matches your actual data dependencies — not the one that looks most organized in a diagram.

Editorial note: API pricing, model capabilities, and tool features change frequently — always verify current details on the vendor’s website before building in production. Code examples are tested at time of writing; pin your dependency versions to avoid breaking changes. Some links in this article may be affiliate links — we may earn a commission if you sign up, at no extra cost to you.

Prompt Chaining vs Prompt Composition: Structuring Complex Multi-Step Agent Workflows

Claude MCP servers: complete setup guide for production tool integrations

Prompt token optimization: reducing LLM API costs without sacrificing quality

Building Claude agents with persistent memory: architecture for multi-session state management

Stacking multiple Claude models in a single workflow: when to use Haiku vs Sonnet vs Opus

Building Claude agents with Starlette 1.0: modern Python web framework integration

Holotron-12B for computer use agents: building high-throughput vision-based automation

Prompt Chaining vs Prompt Composition: Structuring Complex Multi-Step Agent Workflows

What We’re Actually Talking About

When Chaining Is the Right Call

The Classic Use Case: Research → Draft → Edit

Other Cases Where Chaining Wins

When Composition Is the Right Call

The Composition Pattern: Building a Prompt Programmatically

Where Composition Beats Chaining on Token Efficiency

Hybrid Patterns: The Real World Isn’t Clean

A Practical n8n Example

Using Composition to Reduce Chain Length

Failure Modes Worth Knowing Before You Ship

Decision Framework: Which Pattern to Use

Bottom Line: Who Should Use What

Related Posts

Claude MCP servers: complete setup guide for production tool integrations

Prompt token optimization: reducing LLM API costs without sacrificing quality

Building Claude agents with persistent memory: architecture for multi-session state management

Stacking multiple Claude models in a single workflow: when to use Haiku vs Sonnet vs Opus

Building Claude agents with Starlette 1.0: modern Python web framework integration

Holotron-12B for computer use agents: building high-throughput vision-based automation