Context Window Optimization: Fitting More Knowledge Into Claude's Token Budget

If you’ve ever hit Claude’s context limit mid-conversation and watched your carefully assembled prompt get truncated, you already understand the problem. The question isn’t whether context management matters — it’s whether you’re doing it systematically or just hoping your prompts fit. Learning to optimize Claude’s context window is one of the highest-leverage skills you can develop when building production AI systems, and most developers are leaving significant capacity on the table.

Claude 3.5 Sonnet and Haiku both support 200K token context windows. That sounds enormous until you’re running a RAG pipeline, injecting tool outputs, maintaining conversation history, and trying to give the model enough background to actually be useful. Suddenly 200K tokens feels tight. This article covers the concrete techniques I use to fit dramatically more relevant information into Claude’s context without hitting limits or degrading response quality.

Why Naive Context Management Fails in Production

The most common mistake is treating the context window like a buffer — just keep appending until it’s full. This fails for three reasons. First, Claude’s attention isn’t uniformly distributed. The model attends more strongly to content at the beginning and end of the context, a pattern sometimes called the “lost in the middle” problem. Content buried in the middle of a 150K token prompt gets systematically underweighted. Second, raw token count doesn’t equal relevant information density. A 50K token context stuffed with irrelevant chunks is worse than a 10K context with precisely selected content. Third, cost scales linearly with tokens. At Claude 3.5 Sonnet pricing (~$3 per million input tokens), a bloated 100K token prompt costs $0.30 per call — that adds up fast if you’re running hundreds of queries.

The goal isn’t to fill the context window. The goal is to maximize the signal-to-noise ratio within whatever token budget you’re working with.

Hierarchical Summarization: Compressing Without Losing Signal

For long documents or conversation histories, hierarchical summarization is the most reliable compression technique I’ve found. The idea: summarize aggressively at multiple levels of granularity, then inject the right level based on query type.

Building a Three-Tier Summary Structure

For each document or conversation thread, pre-generate three summaries: a one-sentence abstract, a three-paragraph overview, and a detailed section-by-section breakdown. Store all three. At query time, decide which tier to inject based on how relevant the document is to the current query.

import anthropic

client = anthropic.Anthropic()

def generate_summary_tiers(document: str, doc_id: str) -> dict:
    """
    Generate three tiers of summary for a document.
    Run this offline during ingestion, not at query time.
    """
    
    # Tier 1: One-sentence abstract (~20 tokens)
    abstract_response = client.messages.create(
        model="claude-haiku-20240307",  # Use Haiku for bulk summarization — ~10x cheaper
        max_tokens=100,
        messages=[{
            "role": "user",
            "content": f"Summarize this document in one sentence:\n\n{document}"
        }]
    )
    
    # Tier 2: Three-paragraph overview (~300 tokens)
    overview_response = client.messages.create(
        model="claude-haiku-20240307",
        max_tokens=500,
        messages=[{
            "role": "user",
            "content": f"Write a 3-paragraph summary covering the main topics, key findings, and important details:\n\n{document}"
        }]
    )
    
    # Tier 3: Section-by-section breakdown (~1000 tokens)
    detailed_response = client.messages.create(
        model="claude-haiku-20240307",
        max_tokens=1500,
        messages=[{
            "role": "user",
            "content": f"Create a detailed structured summary with sections and bullet points:\n\n{document}"
        }]
    )
    
    return {
        "doc_id": doc_id,
        "abstract": abstract_response.content[0].text,      # ~20 tokens
        "overview": overview_response.content[0].text,      # ~300 tokens  
        "detailed": detailed_response.content[0].text,      # ~1000 tokens
        "full": document                                     # original
    }

def select_summary_tier(relevance_score: float, summaries: dict) -> str:
    """
    Select which tier to inject based on relevance score (0-1).
    High relevance = more detail. Low relevance = just the abstract.
    """
    if relevance_score > 0.85:
        return summaries["detailed"]
    elif relevance_score > 0.6:
        return summaries["overview"]
    elif relevance_score > 0.3:
        return summaries["abstract"]
    else:
        return ""  # Don't inject at all

Using Haiku for bulk summarization during ingestion costs roughly $0.00025 per page of text — negligible compared to the query-time savings you get from injecting 300 tokens instead of 5,000. The key insight is that summarization is a one-time offline cost, not a per-query cost.

Semantic Chunking and Retrieval-Augmented Generation

For knowledge bases larger than a few hundred documents, you can’t even consider injecting everything — you need retrieval. But naive RAG (split into fixed-size chunks, embed, retrieve top-K) has a well-known failure mode: it retrieves semantically similar but contextually incomplete chunks. You get the paragraph that mentions the answer but not the surrounding context that makes the answer interpretable.

Sentence-Window Retrieval

The fix I use most often is sentence-window retrieval. Embed at the sentence level for precision, but retrieve at the paragraph or section level for context. The embedding finds the right location; the larger window gives Claude enough surrounding text to reason correctly.

from dataclasses import dataclass
from typing import List
import numpy as np

@dataclass
class Chunk:
    text: str           # The full paragraph/section (injected into context)
    embed_text: str     # Just the key sentence (used for embedding)
    doc_id: str
    section: str
    token_count: int

def build_sentence_window_chunks(document: str, window_size: int = 3) -> List[Chunk]:
    """
    Split document into paragraphs, but create embeddings for individual sentences.
    At retrieval time, return the full paragraph, not just the matching sentence.
    window_size: number of sentences to use as the embedding target.
    """
    paragraphs = [p.strip() for p in document.split('\n\n') if p.strip()]
    chunks = []
    
    for para in paragraphs:
        sentences = para.split('. ')
        
        # Create overlapping sentence windows for embedding
        for i in range(0, len(sentences), max(1, window_size // 2)):
            window = '. '.join(sentences[i:i + window_size])
            
            chunks.append(Chunk(
                text=para,              # Full paragraph goes into context
                embed_text=window,      # Sentence window used for retrieval
                doc_id="doc_1",
                section=f"para_{len(chunks)}",
                token_count=len(para.split()) * 1.3  # rough token estimate
            ))
    
    return chunks

def retrieve_with_budget(query_embedding: np.ndarray, 
                          chunks: List[Chunk],
                          embeddings: np.ndarray,
                          token_budget: int = 8000) -> List[Chunk]:
    """
    Retrieve chunks by relevance, but stop when we hit the token budget.
    Prevents context overflow from retrieval systems.
    """
    # Cosine similarity
    similarities = np.dot(embeddings, query_embedding) / (
        np.linalg.norm(embeddings, axis=1) * np.linalg.norm(query_embedding)
    )
    
    ranked_indices = np.argsort(similarities)[::-1]
    
    selected = []
    used_tokens = 0
    
    for idx in ranked_indices:
        chunk = chunks[idx]
        if used_tokens + chunk.token_count > token_budget:
            break  # Stop adding chunks once budget is exhausted
        selected.append(chunk)
        used_tokens += chunk.token_count
    
    return selected

The retrieve_with_budget function is something I add to every RAG system now. Without explicit token budgeting at retrieval time, you’ll eventually hit a query that retrieves a cluster of large chunks, blows past your context limit, and crashes your pipeline at 2am. Budget enforcement at retrieval time is non-negotiable in production.

Conversation History Compression

Long-running agents and chatbots accumulate conversation history that eventually exceeds your context budget. The naive solution — truncating old messages — destroys context that might be critical. The right approach is progressive compression.

Rolling Compression with a Summary Buffer

Keep the last N turns verbatim (recent context is highest value), and maintain a running compressed summary of everything older. When the verbatim history grows too large, compress the oldest turns into the summary buffer.

class ConversationManager:
    def __init__(self, verbatim_turns: int = 6, max_summary_tokens: int = 2000):
        self.verbatim_turns = verbatim_turns      # Keep last 6 turns as-is
        self.max_summary_tokens = max_summary_tokens
        self.recent_history = []
        self.compressed_summary = ""
    
    def add_turn(self, role: str, content: str):
        self.recent_history.append({"role": role, "content": content})
        
        # When we exceed verbatim_turns, compress the oldest turns
        if len(self.recent_history) > self.verbatim_turns * 2:  # *2 for user+assistant
            self._compress_oldest_turns()
    
    def _compress_oldest_turns(self):
        """Compress the oldest 2 turns (1 exchange) into the summary buffer."""
        to_compress = self.recent_history[:2]
        self.recent_history = self.recent_history[2:]
        
        exchange_text = "\n".join([
            f"{t['role'].upper()}: {t['content']}" for t in to_compress
        ])
        
        response = client.messages.create(
            model="claude-haiku-20240307",
            max_tokens=300,
            messages=[{
                "role": "user",
                "content": f"""Update this conversation summary to include the new exchange.
Keep only the most important information. Max 200 words.

EXISTING SUMMARY:
{self.compressed_summary}

NEW EXCHANGE:
{exchange_text}

UPDATED SUMMARY:"""
            }]
        )
        
        self.compressed_summary = response.content[0].text
    
    def build_context(self) -> list:
        """Build the messages array to send to Claude."""
        messages = []
        
        # Inject compressed history as a system context block
        if self.compressed_summary:
            messages.append({
                "role": "user",
                "content": f"[Earlier conversation summary: {self.compressed_summary}]"
            })
            messages.append({
                "role": "assistant", 
                "content": "Understood, I have context from our earlier conversation."
            })
        
        # Append verbatim recent history
        messages.extend(self.recent_history)
        return messages

This approach keeps your conversation context bounded at roughly (verbatim_turns × avg_turn_tokens) + max_summary_tokens tokens, regardless of conversation length. For a typical support agent where turns average 200 tokens, that’s about 3,400 tokens for history — leaving 196K+ tokens for everything else.

Structural Prompt Organization for Attention Optimization

Beyond compression, the structure of what you inject matters. Since Claude attends more strongly to content at the beginning and end of the context, put your most critical instructions and the most relevant retrieved content in those positions.

A structure I’ve settled on for RAG agents looks like this:

System prompt (top): Core behavioral instructions, persona, output format requirements — ~500 tokens
Primary retrieved context: The highest-relevance chunks from your retrieval step — placed immediately after the system prompt
Secondary context: Lower-relevance supporting documents, conversation summaries — middle of context
Recent conversation history: Last few turns verbatim — placed near the end
Current user query (bottom): Always last — this is what Claude is actively responding to

Don’t bury your most important retrieved document in position 47 of 60 chunks. Put your top-3 highest-relevance chunks at the start of the knowledge block, and your next-best chunks at the end. Sacrifice the middle positions for lower-value content.

Token Counting Before You Send

Claude’s API now exposes a count_tokens endpoint. Use it. Counting tokens before sending catches budget overruns before they become API errors, and lets you implement graceful degradation — drop the lowest-relevance chunks if you’re over budget.

def safe_send_with_fallback(messages: list, model: str = "claude-3-5-sonnet-20241022", 
                             max_context_tokens: int = 180000) -> str:
    """
    Count tokens first. If over budget, drop chunks until we fit.
    """
    # Claude's token counting endpoint — saves you from mid-flight errors
    token_count = client.messages.count_tokens(
        model=model,
        messages=messages
    )
    
    if token_count.input_tokens > max_context_tokens:
        print(f"Warning: {token_count.input_tokens} tokens exceeds budget. Trimming.")
        # Implement your trimming logic here — drop middle chunks first
        messages = trim_to_budget(messages, max_context_tokens)
    
    response = client.messages.create(
        model=model,
        max_tokens=4096,
        messages=messages
    )
    
    return response.content[0].text

The count_tokens call costs nothing — it’s a free API endpoint. There’s no excuse not to use it in production pipelines where token overruns would otherwise kill a job silently or throw a cryptic error.

When to Use Each Technique

These techniques aren’t mutually exclusive — production systems typically combine several. Here’s how I think about which to apply:

Single long document + specific query: Hierarchical summarization + inject the appropriate tier
Large knowledge base (100+ docs): Sentence-window RAG with token-budgeted retrieval
Long-running agent or chatbot: Rolling compression with summary buffer
Any production system: Pre-send token counting, always
Everything: Structure your prompt with high-value content at top and bottom

For solo founders running tight cost budgets: Use Haiku for summarization/compression steps and Sonnet only for the final generation step. The quality difference in summarization tasks is minimal; the cost difference is 5x. A typical RAG pipeline with Haiku-compressed context and Sonnet generation runs around $0.004-0.008 per query at current pricing — sustainable even at scale.

For teams building production agents: Invest in the sentence-window chunking infrastructure upfront. Fixed-size chunking is a false economy — you’ll spend more debugging retrieval failures than you saved on implementation time. And add token counting as a middleware layer that wraps every Claude call, not an afterthought.

The ability to optimize Claude’s context window effectively is what separates prototypes that demo well from systems that hold up under real workloads. Get the signal-to-noise ratio right, enforce budget constraints at every layer, and you’ll find that 200K tokens is more than enough for most applications — when you’re not wasting half of it on irrelevant content.

Editorial note: API pricing, model capabilities, and tool features change frequently — always verify current details on the vendor’s website before building in production. Code examples are tested at time of writing; pin your dependency versions to avoid breaking changes. Some links in this article may be affiliate links — we may earn a commission if you sign up, at no extra cost to you.

Context Window Optimization: Fitting More Knowledge Into Claude’s Token Budget

Claude MCP servers: complete setup guide for production tool integrations

Prompt token optimization: reducing LLM API costs without sacrificing quality

Building Claude agents with persistent memory: architecture for multi-session state management

Stacking multiple Claude models in a single workflow: when to use Haiku vs Sonnet vs Opus

Building Claude agents with Starlette 1.0: modern Python web framework integration

Holotron-12B for computer use agents: building high-throughput vision-based automation