Most prompt engineering advice stops at “write a good prompt.” That’s fine for simple lookups, but prompt chaining for agents is where the real leverage lives — taking a complex, multi-step problem and breaking it into a sequence of focused prompts where each output feeds the next. Done right, you get more reliable results, cheaper runs, and chains you can actually debug when something breaks. Done wrong, you get cascading errors, bloated context windows, and Claude hallucinating state that was never passed in.
This article covers the mechanics of building production-grade prompt chains: how to structure state passing between steps, how to enforce output consistency so downstream prompts don’t choke, and how to build error recovery that doesn’t just silently swallow failures. Every pattern here has been tested against actual workloads — not toy examples.
Why Single Prompts Break Down on Complex Tasks
A single mega-prompt trying to do research, reasoning, formatting, and decision-making simultaneously runs into several concrete problems. First, Claude’s attention degrades across very long prompts — instructions buried 2,000 tokens in get followed less reliably than instructions at the top. Second, if any step of your logic fails, you have no visibility into which step went wrong. Third, you’re paying for tokens you don’t need — if step 3 only needs the summary from step 2, there’s no reason to feed it the raw data from step 1.
Chaining solves this by giving each prompt a single job with a well-defined input contract and a well-defined output contract. Think of it like function composition: format(analyze(extract(raw_data))). Each function is testable in isolation. The chain is predictable because the interfaces are strict.
The Three Failure Modes You’ll Actually Hit
- Schema drift: Step 2 expects a JSON object with a
summarykey, but step 1 returned markdown with a “Summary” heading. Your parser crashes or silently passes garbage forward. - Context bleed: You stuff the full conversation history into every step “just in case,” and by step 5 you’re paying for 6,000 tokens of context that contribute nothing.
- Silent degradation: A step returns a plausible-looking but incorrect result. Downstream steps confidently build on bad data. You only notice when the final output is wrong in a way that’s hard to trace.
The patterns below address all three directly.
Structuring State: What to Pass, What to Drop
The most important architectural decision in any prompt chain is what state flows between steps. The temptation is to pass everything — append each step’s output to a growing context blob. Don’t. Be surgical.
Define a chain state object at the start. Each step reads from it and writes back to it. Steps declare explicitly which keys they consume and which they produce. This keeps prompts lean and makes debugging trivial — you can inspect state at any node.
from dataclasses import dataclass, field
from typing import Optional
import anthropic
import json
@dataclass
class ChainState:
raw_input: str
extracted_entities: Optional[dict] = None
analysis: Optional[str] = None
final_output: Optional[str] = None
errors: list = field(default_factory=list)
client = anthropic.Anthropic()
def run_step(prompt: str, state_snapshot: dict, step_name: str) -> str:
"""
Runs a single chain step. state_snapshot contains only the keys
this step actually needs — not the full state object.
"""
try:
response = client.messages.create(
model="claude-haiku-4-5", # ~$0.00025 per 1K input tokens at time of writing
max_tokens=1024,
messages=[{"role": "user", "content": prompt}]
)
return response.content[0].text
except Exception as e:
raise RuntimeError(f"Step '{step_name}' failed: {e}")
Notice the model choice: for extraction and formatting steps, Haiku is usually sufficient and costs roughly 10x less than Sonnet. Save Sonnet or Opus for steps that require genuine reasoning depth. A 5-step chain where only step 3 needs heavy reasoning might cost $0.003 total vs $0.015 if you used Sonnet throughout — at scale, that matters.
Enforcing Output Consistency with Typed Schemas
The single most important reliability improvement you can make in a prompt chain is forcing structured output at every boundary. Free-form text between steps is where chains fall apart in production.
For Claude, the most reliable approach is asking for JSON output with an explicit schema in the prompt, then validating before passing forward. Claude 3 models follow JSON schemas well when you give a concrete example rather than an abstract description.
EXTRACTION_PROMPT = """
You are an entity extractor. Given the text below, extract key entities.
Return ONLY valid JSON matching this exact schema — no prose, no markdown fences:
{{
"company": "string or null",
"people": ["list of name strings"],
"dates": ["list of ISO date strings"],
"action_items": ["list of action strings"]
}}
Text to process:
{raw_input}
"""
def step_extract(state: ChainState) -> ChainState:
prompt = EXTRACTION_PROMPT.format(raw_input=state.raw_input)
raw = run_step(prompt, {"raw_input": state.raw_input}, "extract")
# Validate before storing — don't trust the model blindly
try:
parsed = json.loads(raw)
required_keys = {"company", "people", "dates", "action_items"}
if not required_keys.issubset(parsed.keys()):
raise ValueError(f"Missing keys: {required_keys - parsed.keys()}")
state.extracted_entities = parsed
except (json.JSONDecodeError, ValueError) as e:
state.errors.append({"step": "extract", "error": str(e), "raw": raw})
# Decide: halt chain or continue with degraded state?
# For extraction failures, halting is usually right.
raise
return state
Two things to note here. First, “no markdown fences” in the prompt is load-bearing — without it, Claude frequently wraps JSON in triple backticks, breaking your parser. Second, you need to decide at each step whether a validation failure is terminal or recoverable. For extraction steps that feed everything downstream, fail fast. For enrichment steps that add optional metadata, you can skip and continue.
Building Error Recovery That Actually Works
The naive approach to error handling in chains is try/except at the top level: run the whole chain, catch any exception, log it. This is nearly useless — you know something broke but not where or why, and you have no partial results to salvage.
A better pattern is checkpoint-based recovery. After each successful step, persist state. If the chain fails at step 4 of 6, you can resume from step 4 rather than rerunning everything. For long chains on paid APIs, this pays for itself immediately.
import json
from pathlib import Path
class CheckpointedChain:
def __init__(self, chain_id: str, checkpoint_dir: str = "/tmp/chain_checkpoints"):
self.chain_id = chain_id
self.checkpoint_path = Path(checkpoint_dir) / f"{chain_id}.json"
self.checkpoint_path.parent.mkdir(parents=True, exist_ok=True)
def save(self, state: ChainState, completed_step: str):
checkpoint = {
"completed_step": completed_step,
"state": {
"raw_input": state.raw_input,
"extracted_entities": state.extracted_entities,
"analysis": state.analysis,
"final_output": state.final_output,
"errors": state.errors
}
}
self.checkpoint_path.write_text(json.dumps(checkpoint))
def load(self) -> Optional[tuple[ChainState, str]]:
if not self.checkpoint_path.exists():
return None
data = json.loads(self.checkpoint_path.read_text())
state = ChainState(**data["state"])
return state, data["completed_step"]
def run(self, initial_state: ChainState, steps: list[tuple[str, callable]]) -> ChainState:
# Check for existing checkpoint
checkpoint = self.load()
if checkpoint:
state, last_completed = checkpoint
completed_names = {s[0] for s in steps[:steps.index(
next(s for s in steps if s[0] == last_completed)
) + 1]}
remaining_steps = [(name, fn) for name, fn in steps
if name not in completed_names]
print(f"Resuming from checkpoint after '{last_completed}'")
else:
state = initial_state
remaining_steps = steps
for step_name, step_fn in remaining_steps:
state = step_fn(state) # Raises on unrecoverable error
self.save(state, step_name)
print(f"✓ Completed: {step_name}")
return state
In production, swap the file-based checkpoint for Redis or your existing datastore. The pattern is the same — serialize state after each node, look up on retry. This approach also makes your chains idempotent, which matters when you’re running them inside n8n or Make workflows where retries can be triggered automatically.
Sequential Reasoning: When to Chain vs When to Use Tool Calls
A question that comes up constantly: should you use prompt chaining or should you use Claude’s tool/function calling to orchestrate multi-step work? The honest answer is that they solve slightly different problems.
Use prompt chaining when:
- Each step produces text that the next step reasons over (summarize → critique → revise)
- You want deterministic step ordering with explicit state management
- Steps have meaningfully different complexity and you want to route them to different models
- You’re building outside a Claude session (batch jobs, n8n automations, backend pipelines)
Use tool calls when:
- Steps involve external API calls, database lookups, or code execution
- The model needs to decide dynamically which steps to take based on intermediate results
- You want Claude to orchestrate the loop rather than your application code
In practice, many production agents combine both: your application code runs the outer chain loop with explicit state management, and individual chain steps use tool calls for external data fetching. Don’t treat them as mutually exclusive.
Putting It Together: A Real Pipeline
Here’s what a complete 4-step chain looks like assembled — this processes a raw meeting transcript into structured action items with assigned owners and deadlines:
def run_meeting_pipeline(transcript: str, chain_id: str) -> ChainState:
initial_state = ChainState(raw_input=transcript)
chain = CheckpointedChain(chain_id)
# Define steps in order — each function signature: (ChainState) -> ChainState
steps = [
("extract_entities", step_extract), # Haiku — cheap structured extraction
("analyze_discussion", step_analyze), # Sonnet — reasoning over content
("generate_action_items", step_actions), # Haiku — structured output from analysis
("format_output", step_format), # Haiku — final formatting
]
final_state = chain.run(initial_state, steps)
if final_state.errors:
print(f"Completed with {len(final_state.errors)} non-fatal errors:")
for err in final_state.errors:
print(f" - {err['step']}: {err['error']}")
return final_state
# Usage
result = run_meeting_pipeline(transcript_text, chain_id="meeting_2024_01_15_standup")
print(result.final_output)
At current pricing, this 4-step chain on a 500-word transcript costs roughly $0.0008 using the Haiku/Sonnet split described above. Running it in pure Sonnet would cost around $0.006 — 7x more for most of the value coming from the one step that actually needs it.
When to Use This Approach
Solo founders building automation pipelines: Start with prompt chaining for agents immediately. The checkpoint pattern pays off even on day one — you won’t burn API credits re-running successful steps because step 5 had a schema bug.
Teams building production agents: Add the typed schema validation layer before anything else hits staging. Schema drift between chain steps is the #1 source of silent failures in multi-step LLM pipelines. Treat every chain boundary like an API contract.
Cost-sensitive workloads: The model routing pattern (Haiku for extraction/formatting, Sonnet for reasoning) typically cuts chain costs by 60-80% with minimal quality loss. Profile which steps actually need deeper reasoning rather than assuming all steps need the same model.
n8n / Make users: The checkpointing pattern maps directly to your existing workflow retry logic. Store chain state in your database node between LLM call nodes. Each LLM node becomes a typed transformer with explicit input/output contracts — easier to maintain than a single massive prompt node.
The core principle of prompt chaining for agents is treating your LLM calls like software components: defined interfaces, observable state, recoverable failures. Once you build that way, complex multi-step AI tasks stop being brittle and start being maintainable.
Editorial note: API pricing, model capabilities, and tool features change frequently — always verify current details on the vendor’s website before building in production. Code examples are tested at time of writing; pin your dependency versions to avoid breaking changes. Some links in this article may be affiliate links — we may earn a commission if you sign up, at no extra cost to you.

