Temperature and top-p in production: when to randomize LLM outputs for agents

Most developers set temperature once and forget it. They pick 0.7 because it “feels balanced,” or they hammer everything to 0 because “determinism is safer.” Both approaches cost you in production — one way is subtle bugs and boring outputs, the other is brittle agents that can’t generate variation when you actually need it. Getting temperature top-p LLM production settings right is one of those unsexy optimizations that quietly improves output quality across every task type you’re running.

This article covers how these parameters actually work under the hood, the three most common misconceptions that will burn you, and exact recommended settings for code generation, creative writing, and reasoning-heavy agents. There’s also a reusable Python wrapper you can drop into any project today.

What Temperature and Top-p Actually Control (Not What the Docs Say)

The standard explanation is “temperature controls randomness.” That’s accurate but not useful. Here’s what’s actually happening:

When an LLM generates a token, it produces a probability distribution over its entire vocabulary — potentially 50,000+ tokens. Temperature is a divisor applied to the raw logits (pre-softmax scores) before sampling. A temperature of 1.0 leaves the distribution unchanged. Temperature 0.5 sharpens it — high-probability tokens get relatively higher, low-probability tokens get crushed. Temperature 2.0 flattens it — everything becomes more equally likely, including nonsense.

At temperature 0, you’re doing greedy decoding: always pick the single highest-probability token. This is not deterministic on most hosted APIs — the documentation is misleading here. Floating-point non-determinism across GPU runs means you can get different outputs at temperature 0 even with identical inputs. Plan accordingly.

Top-p (nucleus sampling) works differently. Instead of scaling the whole distribution, it dynamically restricts the candidate token pool. With top-p 0.9, the model samples only from the smallest set of tokens whose cumulative probability adds up to 90%. If three tokens account for 90% of the probability mass, only those three are candidates. If it takes 800 tokens to reach 90%, all 800 are in the pool.

The practical difference: temperature is a blunt instrument that shifts the entire distribution; top-p is a scalpel that adapts to context. When the model is highly confident (peaked distribution), top-p naturally restricts candidates. When it’s genuinely uncertain, top-p allows more options. This makes top-p more context-sensitive than temperature alone.

Top-k: The Third Parameter You’re Probably Ignoring

Top-k caps the candidate pool at a fixed number regardless of probability distribution. Top-k 50 means sample from only the top 50 tokens by probability. It’s cruder than top-p and mostly useful for open-source inference where you want a hard ceiling. For hosted APIs like Claude or GPT-4, you rarely need to touch it — but if you’re running Mistral or Llama locally via Ollama, it’s worth combining: top-k 40, top-p 0.9 is a reasonable starting point.

The Three Misconceptions That Break Production Agents

Misconception 1: Temperature 0 = Deterministic

Already touched on this, but it’s worth being explicit: if you’re relying on temperature 0 for exact reproducibility in tests or pipelines, you will get surprised. OpenAI acknowledges non-determinism in their docs. Anthropic’s API can produce different outputs at temperature 0 across runs. The variance is small but real.

If you need true reproducibility (for evals, regression testing, caching), use a hash of the prompt as a cache key and store the first output — don’t rely on identical sampling. This is especially important if you’re building fallback and retry logic where you need to predict what a retry will produce.

Misconception 2: Lower Temperature Always Means Higher Quality

This is the one that kills creative and writing agents. When temperature is too low, the model locks into its most probable completion paths — which often means generic, repetitive, or formulaic output. For code, that’s usually fine. For any task involving variation, tone, or originality, temperature 0–0.3 will produce outputs that feel like they were written by committee.

I ran a test generating 20 product descriptions at temperature 0.2 vs 0.8 for an e-commerce agent. At 0.2, 14 of 20 descriptions opened with “Introducing the…” or “Discover the…”. At 0.8, that dropped to 3 of 20. The higher-temperature outputs weren’t worse — they were more varied, which is the actual requirement for this use case. Don’t conflate “consistent” with “high quality.”

Misconception 3: You Should Set Top-p and Temperature Together

Most API documentation shows both parameters in examples. Most practitioners set both. This is almost always wrong. Temperature and top-p interact in non-obvious ways, and using both makes your outputs harder to reason about and tune. Pick one as your primary lever and leave the other at its default.

My recommendation: use temperature as your primary control (it’s more intuitive), and set top-p to 1.0 (effectively off). The exception is when you’re running high-temperature creative tasks and want to prevent genuinely degenerate outputs — in that case, cap top-p at 0.95 as a safety net.

Exact Settings for Each Agent Type

These are tuned settings based on running production workloads. They’re starting points, not gospel — run evals on your specific task before locking them in.

Code Generation Agents

Temperature: 0.1–0.2
Top-p: 1.0
Reasoning: Code has near-correct answers. You want the model’s highest-confidence output. At temperature 0.1, you still get slight variation that avoids the greedy decoding trap, but outputs are tight and consistent.
Watch out for: Unit test generation at temperature 0 — tests end up structurally identical and miss edge cases. Bump to 0.3 specifically for test generation.

See our Claude vs GPT-4 code generation benchmark for model-specific behavior — these temperature settings interact differently depending on which model you’re using.

Reasoning and Analysis Agents

Temperature: 0.0–0.3
Top-p: 1.0
Reasoning: For classification, extraction, fact-checking, and structured analysis, you want the model’s best single answer. Higher temperature here introduces noise without benefit. If you’re worried about hallucinations, temperature is not your fix — structured output constraints and verification layers are. See the patterns in reducing LLM hallucinations in production for the right approach.

Creative Writing and Content Agents

Temperature: 0.7–1.0
Top-p: 0.95 (safety cap)
Reasoning: You want variation. If you’re generating marketing copy, email subjects, or creative content at scale, temperature under 0.6 will produce repetitive outputs that users immediately notice feel “AI-generated.” The 0.95 top-p cap prevents truly degenerate token choices at the high end.
Watch out for: Anything above 1.2 on most models produces noticeably incoherent text. Don’t chase entropy — 0.9 is the practical ceiling.

Conversational and Customer Support Agents

Temperature: 0.4–0.6
Top-p: 1.0
Reasoning: You want consistent tone and reliable information, but responses shouldn’t feel robotic. The middle range gives you enough variation to feel natural while keeping factual claims stable. Pair this with good role prompting for consistent agent personality.

A Reusable Python Wrapper for Per-Task Temperature

Rather than hardcoding temperature into every API call, route it through a task-type registry. This makes tuning and testing significantly easier:

import anthropic
from enum import Enum
from typing import Optional

class TaskType(Enum):
    CODE = "code"
    REASONING = "reasoning"
    CREATIVE = "creative"
    CONVERSATIONAL = "conversational"
    ANALYSIS = "analysis"

# Tuned defaults per task type
TASK_SETTINGS = {
    TaskType.CODE:          {"temperature": 0.1, "top_p": 1.0},
    TaskType.REASONING:     {"temperature": 0.2, "top_p": 1.0},
    TaskType.CREATIVE:      {"temperature": 0.85, "top_p": 0.95},
    TaskType.CONVERSATIONAL:{"temperature": 0.5, "top_p": 1.0},
    TaskType.ANALYSIS:      {"temperature": 0.1, "top_p": 1.0},
}

client = anthropic.Anthropic()

def call_with_task_settings(
    prompt: str,
    task_type: TaskType,
    system: Optional[str] = None,
    model: str = "claude-3-5-haiku-20241022",
    override_temperature: Optional[float] = None,
) -> str:
    settings = TASK_SETTINGS[task_type].copy()
    
    # Allow per-call overrides without touching defaults
    if override_temperature is not None:
        settings["temperature"] = override_temperature

    kwargs = {
        "model": model,
        "max_tokens": 1024,
        "messages": [{"role": "user", "content": prompt}],
        **settings,
    }
    
    if system:
        kwargs["system"] = system

    response = client.messages.create(**kwargs)
    return response.content[0].text

# Usage — code task, no override needed
result = call_with_task_settings(
    prompt="Write a Python function to parse ISO 8601 dates",
    task_type=TaskType.CODE,
)

# Creative task with explicit override for experimentation
result = call_with_task_settings(
    prompt="Write 5 subject lines for a Black Friday email",
    task_type=TaskType.CREATIVE,
    override_temperature=1.0,  # Push variation higher for A/B testing
)

Running this at scale with Claude 3.5 Haiku costs roughly $0.0008 per 1K input tokens and $0.004 per 1K output tokens (as of mid-2025 — verify current pricing). For a typical 500-token prompt + 300-token response, that’s about $0.0016 per call. The task registry pattern adds zero latency overhead since it’s just a dict lookup before the API call.

Testing Temperature Settings: How to Actually Eval This

Don’t trust your gut on this. Run actual evals. Here’s a minimal setup:

from collections import Counter
import json

def eval_temperature_variance(
    prompt: str,
    task_type: TaskType,
    temperatures: list[float],
    n_samples: int = 10,
) -> dict:
    """
    For each temperature, generate n_samples outputs and measure:
    - Unique output ratio (diversity)
    - Average output length
    """
    results = {}
    
    for temp in temperatures:
        outputs = []
        for _ in range(n_samples):
            output = call_with_task_settings(
                prompt=prompt,
                task_type=task_type,
                override_temperature=temp,
            )
            outputs.append(output)
        
        unique_ratio = len(set(outputs)) / len(outputs)
        avg_length = sum(len(o) for o in outputs) / len(outputs)
        
        results[temp] = {
            "unique_ratio": unique_ratio,      # 1.0 = all unique, 0.0 = all identical
            "avg_length": avg_length,
            "sample": outputs[0],              # First output for manual review
        }
    
    return results

# Run for creative task
eval_results = eval_temperature_variance(
    prompt="Write a one-sentence product tagline for a password manager",
    task_type=TaskType.CREATIVE,
    temperatures=[0.2, 0.5, 0.7, 0.9, 1.1],
    n_samples=10,
)
print(json.dumps({str(k): {**v, "sample": v["sample"][:80]} 
                  for k, v in eval_results.items()}, indent=2))

Run this against your actual prompts and look at unique_ratio. For creative tasks, you want 0.8+. For code tasks, 0.3–0.5 is appropriate (some variation in style/naming is fine; identical logic is expected). If your “creative” agent is scoring 0.2 unique ratio, your temperature is too low regardless of what feels right intuitively.

When to Dynamically Adjust Temperature at Runtime

Static settings per task type get you 80% of the way there. For the remaining 20%, consider runtime adjustment:

Retry on failure: If an LLM call returns an unusable output (parse error, refusal, hallucinated format), retry with temperature +0.1. Sometimes the model is “stuck” in a bad completion path and a small temperature bump breaks it out. This pairs well with error handling patterns for production Claude agents.
Diversity-on-demand: If you’re generating N options for a user to choose from (email subjects, blog titles, ad copy variants), bump temperature relative to N. Generating 3 options? 0.7 works. Generating 10? Push to 1.0 or you’ll get near-duplicates.
Confidence-based routing: Some pipelines use a first pass at temperature 0 for structured extraction, then a second pass at 0.5 for natural language explanation of the extracted data. Different task types, different settings, same pipeline run.

Bottom Line: Which Settings for Which Situation

Solo founder running automation workflows: Use the task registry pattern above. Start with the defaults I’ve provided, run the eval script on your top 3 prompt types, adjust once. You’ll gain more from this than from any prompt rewording.

Team building a multi-agent system: Treat temperature as a first-class configuration parameter, not a magic number buried in code. Store it alongside your system prompts, version it, and include it in your eval runs. When outputs degrade after a model update, temperature drift is often the culprit.

High-volume batch processing: Temperature 0.1 across the board unless you have a specific reason otherwise. Consistency and cost predictability matter more than output variety at scale. At 10,000 calls per day, even a 5% increase in output length from higher temperature adds up fast on token costs.

Getting temperature top-p LLM production settings dialed in is a one-time investment that pays out continuously. The task registry pattern above takes about 30 minutes to implement, and the eval script will tell you within an hour whether your current settings are actually serving your use case — or just inherited from someone’s blog post they read two years ago.

Frequently Asked Questions

What temperature should I use for Claude agents in production?

It depends on the task. For code generation and structured extraction, use 0.1–0.2. For conversational agents, 0.4–0.6. For creative writing and content variation tasks, 0.7–0.9. Don’t use a single temperature across all agent tasks — the performance difference is measurable and significant.

What is the difference between temperature and top-p in LLMs?

Temperature scales the entire probability distribution over the vocabulary — lower values sharpen it (more predictable), higher values flatten it (more random). Top-p dynamically restricts the candidate pool to the smallest set of tokens summing to probability p. Temperature is a blunt instrument; top-p is context-sensitive. For most use cases, use temperature as your primary lever and leave top-p at 1.0.

Does temperature 0 guarantee the same output every time?

No. Both OpenAI and Anthropic APIs can produce different outputs at temperature 0 due to floating-point non-determinism across GPU runs. For reproducibility in testing or caching, use prompt hashing to store and retrieve outputs rather than relying on identical sampling behavior.

Should I set both temperature and top-p at the same time?

Generally no. Setting both parameters simultaneously makes outputs harder to reason about and tune because they interact non-linearly. Pick one as your primary control. The one reasonable exception is using top-p as a ceiling (e.g., 0.95) alongside high temperature settings to prevent degenerate outputs in creative tasks.

Why does my LLM agent produce repetitive outputs even at moderate temperature?

Repetitive outputs at temperature 0.5–0.7 usually indicate the prompt is over-constraining the model rather than a temperature problem. Highly specific formatting instructions, very short max_tokens limits, or system prompts that narrow the output space too aggressively can override temperature effects. Try removing constraints first before raising temperature further.

How does temperature affect LLM hallucinations?

Higher temperature increases the chance the model samples low-probability tokens, which can include plausible-sounding but incorrect facts. However, lowering temperature to 0 does not eliminate hallucinations — the model can still confidently produce the highest-probability incorrect output. Structural verification, retrieval grounding, and output validation are the real solutions; temperature is a weak lever for hallucination control.

Put this into practice

Try the Prompt Engineer agent — ready to use, no setup required.

Browse Agents →

Editorial note: API pricing, model capabilities, and tool features change frequently — always verify current details on the vendor’s website before building in production. Code examples are tested at time of writing; pin your dependency versions to avoid breaking changes. Some links in this article may be affiliate links — we may earn a commission if you sign up, at no extra cost to you.

Temperature and top-p in production: when to randomize LLM outputs for agents

Claude MCP servers: complete setup guide for production tool integrations

Prompt token optimization: reducing LLM API costs without sacrificing quality

Building Claude agents with persistent memory: architecture for multi-session state management

Stacking multiple Claude models in a single workflow: when to use Haiku vs Sonnet vs Opus

Building Claude agents with Starlette 1.0: modern Python web framework integration

Holotron-12B for computer use agents: building high-throughput vision-based automation

Temperature and top-p in production: when to randomize LLM outputs for agents

What Temperature and Top-p Actually Control (Not What the Docs Say)

Top-k: The Third Parameter You’re Probably Ignoring

The Three Misconceptions That Break Production Agents

Misconception 1: Temperature 0 = Deterministic

Misconception 2: Lower Temperature Always Means Higher Quality

Misconception 3: You Should Set Top-p and Temperature Together

Exact Settings for Each Agent Type

Code Generation Agents

Reasoning and Analysis Agents

Creative Writing and Content Agents

Conversational and Customer Support Agents

A Reusable Python Wrapper for Per-Task Temperature

Testing Temperature Settings: How to Actually Eval This

When to Dynamically Adjust Temperature at Runtime

Bottom Line: Which Settings for Which Situation

Frequently Asked Questions

What temperature should I use for Claude agents in production?

What is the difference between temperature and top-p in LLMs?

Does temperature 0 guarantee the same output every time?

Should I set both temperature and top-p at the same time?

Why does my LLM agent produce repetitive outputs even at moderate temperature?

How does temperature affect LLM hallucinations?

Put this into practice

Related Claude Code Agents

Related Posts

Claude MCP servers: complete setup guide for production tool integrations

Prompt token optimization: reducing LLM API costs without sacrificing quality

Building Claude agents with persistent memory: architecture for multi-session state management

Stacking multiple Claude models in a single workflow: when to use Haiku vs Sonnet vs Opus

Building Claude agents with Starlette 1.0: modern Python web framework integration

Holotron-12B for computer use agents: building high-throughput vision-based automation