Saturday, March 21

Most developers treat zero-shot vs few-shot as a coin flip — throw some examples in if the output looks bad, skip them if it seems fine. That’s leaving real quality on the table. When you’re building with few-shot prompting Claude agents, the decision of whether to include examples (and how many) has measurable impact on output quality, latency, and token cost. Get it wrong and you’re either burning tokens on examples that add nothing, or shipping an agent that consistently misformats output because you were too frugal with context.

This article gives you an empirical framework for making that call — not vibes-based, but based on the actual signals Claude’s output gives you about whether it needs examples to do the job right.

What Zero-Shot and Few-Shot Actually Mean in an Agent Context

Zero-shot prompting means you describe the task and expected output in plain language, with no worked examples. Claude infers the pattern entirely from your instructions. Few-shot prompting means you include one or more input/output pairs that demonstrate exactly what you want before presenting the real task.

In a simple chatbot, this distinction is minor. In an agent pipeline, it matters a lot more. Agents typically need to produce structured outputs — JSON tool calls, formatted reports, classification labels, extraction schemas — and those outputs get parsed downstream. A slight deviation in format can break the whole chain.

Here’s what zero-shot looks like for a classification agent:

system_prompt = """
You are a support ticket classifier. 
Classify each ticket into one of: billing, technical, account, general.
Return only the label, nothing else.
"""

user_message = "My invoice shows a charge I don't recognize from last month."

And here’s the same task with two-shot prompting:

system_prompt = """
You are a support ticket classifier.
Classify each ticket into one of: billing, technical, account, general.
Return only the label, nothing else.

Examples:
User: "I can't log into my account after the password reset."
Label: account

User: "The API keeps returning 500 errors on POST requests."
Label: technical
"""

user_message = "My invoice shows a charge I don't recognize from last month."

Both will usually return “billing” here. So why add the examples at all? You wouldn’t — for this specific case. The question is knowing when that calculus flips.

When Zero-Shot Works Fine (Don’t Overcomplicate It)

Claude Sonnet and Haiku are genuinely good zero-shot models for well-defined tasks. You don’t always need examples. Zero-shot is the right default when:

  • The task maps cleanly to common patterns. Summarization, sentiment classification, simple extraction, translation — Claude has seen millions of examples of these during training. Your instructions alone are usually enough.
  • The output format is standard. If you’re asking for JSON that follows a schema, Claude handles it reliably with good schema documentation in the prompt. Same for markdown, numbered lists, CSV rows.
  • You’re working with Claude 3.5 Sonnet or Claude 3 Opus. These models have strong instruction-following out of the box. Haiku is more benefit-from-examples-prone on complex tasks.
  • Latency or cost is a real constraint. At Haiku pricing (~$0.00025 per 1K input tokens), adding 200 tokens of examples across 10,000 daily agent calls adds roughly $0.50/day — not catastrophic, but pointless if the examples don’t improve output.

A practical test: run your prompt zero-shot 20 times across varied inputs and check the output distribution. If format compliance is above 95% and semantic accuracy looks right, you’re done. Don’t add examples for the sake of it.

The Signals That Tell You Few-Shot Is Necessary

Zero-shot breaks down in predictable ways. Learning to recognize these patterns early saves you from shipping fragile agents.

Inconsistent Output Format

This is the most common failure mode. You ask for JSON and 90% of calls return valid JSON, but 10% include a leading “Here’s the JSON:” or wrap the output in markdown code fences. Downstream your parser chokes. This is exactly where one clean example fixes the problem — not more instructions, but a demonstrated example of what “correct” looks like.

# Without example: Claude sometimes returns ```json {...} ``` 
# With this example in the prompt, it stops

example = """
Example:
Input: "Customer called about delayed shipment, was frustrated, issue resolved."
Output: {"sentiment": "negative", "resolved": true, "category": "shipping"}
"""

Ambiguous Task Definitions

When the task involves judgment calls that reasonable people could interpret differently, examples act as calibration. “Write a professional email response” is ambiguous. Professional to a B2B SaaS company? A law firm? A startup? One example of the tone and register you want does more than three paragraphs of description.

Domain-Specific Patterns Claude Wasn’t Trained On

Your internal taxonomy, your company’s specific terminology, proprietary classification schemes — Claude will attempt to follow these from instructions alone, but it’ll drift. If you classify bugs as P0/P1/P2 but have internal definitions that don’t match standard severity conventions, show don’t tell.

Edge Cases That Keep Tripping the Agent

If you’re in production and a specific input pattern keeps producing wrong outputs, add it as a few-shot example. This is targeted debugging, not general prompting. One well-chosen counterexample can eliminate an entire failure class.

How to Structure Few-Shot Examples for Maximum Impact

Randomly throwing examples at a prompt rarely works as well as structured, deliberate example selection. Here’s what actually matters:

Quality Beats Quantity

Two perfect examples outperform six mediocre ones. Each example should be a canonical instance of correct behavior — unambiguous input, exactly the right output, no edge-case weirdness. If you’re unsure whether an example is clean enough to include, leave it out.

Research on few-shot prompting (including Anthropic’s own model cards) consistently shows diminishing returns after 3-5 examples for most classification and extraction tasks. Beyond that, you’re burning tokens without meaningful gains unless you’re covering genuinely distinct sub-cases.

Cover the Diversity, Not the Majority

A common mistake: using examples that all look similar. If 80% of your inputs are type A and 20% are type B, your instinct is to weight examples toward type A. Counterintuitively, you should weight toward type B — Claude already handles type A fine without examples. Use your example budget on the cases that are hard.

Keep Examples in the System Prompt, Not the User Turn

For agent pipelines where the user turn contains the actual task input, keep examples in the system prompt. This keeps your message structure clean and prevents Claude from confusing example inputs with real inputs.

import anthropic

client = anthropic.Anthropic()

def classify_ticket(ticket_text: str) -> str:
    response = client.messages.create(
        model="claude-3-5-haiku-20241022",
        max_tokens=50,
        system="""Classify support tickets into: billing, technical, account, general.
Return only the label.

Examples:
Input: "My card was charged twice for the same order."
Output: billing

Input: "I need to update the email address on my account."
Output: account

Input: "The webhook endpoint stopped receiving events after your last deployment."
Output: technical""",
        messages=[
            {"role": "user", "content": ticket_text}
        ]
    )
    return response.content[0].text.strip()

# Example usage
result = classify_ticket("My invoice shows a charge I don't recognize.")
print(result)  # billing

Format Examples Exactly Like Expected Output

This sounds obvious but gets skipped constantly. If your production code strips whitespace from Claude’s response before parsing, your examples don’t need to include whitespace. But if you expect a specific JSON key order, your examples should reflect that. Claude pattern-matches on the examples you give it — be precise.

Testing the Tradeoff: A Practical Evaluation Framework

Don’t guess. Run a quick eval before committing to a prompting strategy.

import anthropic
from typing import Callable

client = anthropic.Anthropic()

def evaluate_prompting_strategy(
    system_prompt: str,
    test_cases: list[dict],  # [{"input": "...", "expected": "..."}]
    model: str = "claude-3-5-haiku-20241022"
) -> dict:
    """
    Run a set of test cases and return accuracy + token usage.
    Compare zero-shot vs few-shot by passing different system prompts.
    """
    results = {"correct": 0, "total": len(test_cases), "total_input_tokens": 0}
    
    for case in test_cases:
        response = client.messages.create(
            model=model,
            max_tokens=100,
            system=system_prompt,
            messages=[{"role": "user", "content": case["input"]}]
        )
        
        output = response.content[0].text.strip().lower()
        expected = case["expected"].lower()
        
        if output == expected:
            results["correct"] += 1
        
        results["total_input_tokens"] += response.usage.input_tokens
    
    results["accuracy"] = results["correct"] / results["total"]
    results["avg_tokens_per_call"] = results["total_input_tokens"] / results["total"]
    return results

# Compare strategies
zero_shot_results = evaluate_prompting_strategy(zero_shot_prompt, test_cases)
few_shot_results = evaluate_prompting_strategy(few_shot_prompt, test_cases)

print(f"Zero-shot accuracy: {zero_shot_results['accuracy']:.1%}")
print(f"Zero-shot avg tokens: {zero_shot_results['avg_tokens_per_call']:.0f}")
print(f"Few-shot accuracy: {few_shot_results['accuracy']:.1%}")
print(f"Few-shot avg tokens: {few_shot_results['avg_tokens_per_call']:.0f}")

# If few-shot accuracy improvement < 5%, probably not worth the token cost
delta = few_shot_results['accuracy'] - zero_shot_results['accuracy']
print(f"Accuracy gain: {delta:.1%} — {'worth it' if delta > 0.05 else 'skip the examples'}")

Run this across 50-100 representative test cases. If few-shot improves accuracy by less than 5%, the token cost isn’t justified for most production use cases. If it’s 10%+, the examples are doing real work and you should keep them.

The Token Cost Reality Check

Let’s be concrete. Suppose you’re running a classification agent on Haiku processing 50,000 tickets/month. A zero-shot system prompt is 150 tokens. Adding three examples adds ~200 tokens.

  • Zero-shot: 50,000 × 150 = 7.5M input tokens → ~$1.88/month
  • Few-shot: 50,000 × 350 = 17.5M input tokens → ~$4.38/month

That’s $2.50/month extra. If your examples push accuracy from 91% to 97%, you’ve eliminated ~3,000 misclassifications per month. Almost certainly worth it. If accuracy goes from 96% to 97%, you’re paying $2.50 to fix 500 tickets. Maybe still worth it — depends entirely on the cost of a misclassification in your workflow.

The math changes dramatically on Sonnet (~10x the cost of Haiku) or for agents with longer context. Always run the numbers for your specific volume.

Bottom Line: When to Use Each Approach

Use zero-shot if: the task is well-defined, output format is standard, you’re using Sonnet or Opus, and your eval shows 95%+ accuracy without examples. This is the right default starting point for every new agent.

Add few-shot examples if: you’re seeing format inconsistency in production, the task involves ambiguous judgment calls, you’re using Haiku on complex tasks, or you have a specific failure pattern you need to eliminate. Two to three highly curated examples handle 90% of the cases where few-shot actually helps.

Don’t use few-shot as a crutch for bad instructions. If your task description is ambiguous, examples will paper over it temporarily but the underlying problem will resurface on inputs your examples didn’t cover. Fix the instructions first, then decide if examples are still needed.

For solo founders building internal tools: default to zero-shot and only add examples when you see actual failures. You’re usually running lower volumes where the difference in monthly cost is negligible but the time to curate good examples is real. For teams shipping production agents at scale, build the evaluation harness above into your CI pipeline — the few-shot vs zero-shot decision should be data-driven, not a judgment call made at 2am when something’s broken.

Few-shot prompting Claude agents isn’t magic — it’s a targeted fix for specific failure modes. Use it that way, and you’ll build agents that are both cheaper and more reliable than ones built with cargo-culted “more examples = better” thinking.

Editorial note: API pricing, model capabilities, and tool features change frequently — always verify current details on the vendor’s website before building in production. Code examples are tested at time of writing; pin your dependency versions to avoid breaking changes. Some links in this article may be affiliate links — we may earn a commission if you sign up, at no extra cost to you.

Share.
Leave A Reply