Contract review automation: building a Claude agent that flags risks and extracts terms

Q: How do I handle contracts that are scanned PDFs (images, not text)?

Add an OCR step before extract_text_from_pdf(). The most reliable Python option is pytesseract with pdf2image to convert pages to images first. Alternatively, use Claude's vision capability directly by passing each page as a base64-encoded image — this is slower and more expensive but handles complex layouts better than Tesseract on messy scans.

By the end of this tutorial, you’ll have a working contract review Claude agent that ingests PDF or plain-text contracts, flags potentially risky clauses (indemnification, auto-renewal, limitation of liability, and more), extracts key terms into structured JSON, and produces a plain-English executive summary — all in under 30 seconds per document.

Legal review is one of the highest-leverage places to deploy an LLM. A founder reviewing a vendor agreement, a ops team processing dozens of SaaS contracts, or a legal team triaging NDAs before sending to counsel — all of them share the same bottleneck: reading the same clause patterns over and over. This agent doesn’t replace a lawyer. It removes the repetitive part so the lawyer (or you) can focus on the three clauses that actually need judgment.

Install dependencies — Set up the Python environment with the Anthropic SDK and PDF parsing tools.
Parse and chunk the contract — Extract clean text from PDFs and split intelligently.
Design the system prompt — Give the agent clear extraction and risk-flagging instructions.
Build the extraction function — Call Claude with structured output requirements.
Add risk scoring logic — Score and rank flagged clauses by severity.
Generate the executive summary — Produce a human-readable summary from structured data.
Wire it together as a CLI tool — Combine all steps into a usable script.

Step 1: Install Dependencies

You need three things: the Anthropic SDK, a PDF parser, and Pydantic for structured output validation. I’m using pdfplumber over PyPDF2 because it handles multi-column layouts and tables significantly better — matters a lot for contracts with schedule tables.

pip install anthropic pdfplumber pydantic python-dotenv

Create a .env file with your key:

ANTHROPIC_API_KEY=sk-ant-...

Step 2: Parse and Chunk the Contract

Most contracts are 10–40 pages. Claude’s 200K context window handles that easily, but chunking by section still gives you better attribution — you want to know which clause is risky, not just that something somewhere is risky.

import pdfplumber
import re
from pathlib import Path

def extract_text_from_pdf(path: str) -> str:
    """Extract clean text from a PDF contract."""
    text = []
    with pdfplumber.open(path) as pdf:
        for page in pdf.pages:
            page_text = page.extract_text()
            if page_text:
                text.append(page_text)
    return "\n\n".join(text)

def split_into_sections(text: str) -> list[dict]:
    """
    Split contract text into named sections.
    Looks for numbered headings like '1.', '2.1', 'SECTION 3', etc.
    Returns list of {"heading": str, "content": str}
    """
    # Pattern matches common contract section headers
    section_pattern = re.compile(
        r'(?:^|\n)((?:SECTION\s+)?\d+(?:\.\d+)*\.?\s+[A-Z][A-Z\s]{3,})',
        re.MULTILINE
    )
    
    matches = list(section_pattern.finditer(text))
    sections = []
    
    for i, match in enumerate(matches):
        heading = match.group(1).strip()
        start = match.end()
        end = matches[i + 1].start() if i + 1 < len(matches) else len(text)
        content = text[start:end].strip()
        if content:
            sections.append({"heading": heading, "content": content})
    
    # If no sections found, treat entire doc as one block
    if not sections:
        sections = [{"heading": "Full Contract", "content": text}]
    
    return sections

Step 3: Design the System Prompt

This is where most contract review agents fall down. Vague prompts like “review this contract for risks” produce vague outputs. You need to tell Claude exactly what risk categories to look for, what to extract, and what output format to return. If you want consistent agent behavior, role prompting with precise instructions makes a measurable difference here.

SYSTEM_PROMPT = """You are a contract analysis specialist with expertise in commercial law. 
Your job is to analyze contract sections and return structured JSON — nothing else.

For each section provided, you must:

1. EXTRACT these key terms if present (return null if absent):
   - payment_terms: payment schedule, amounts, late fees
   - termination_notice: required notice period for termination
   - auto_renewal: whether contract auto-renews and conditions
   - governing_law: jurisdiction and governing law
   - liability_cap: maximum liability amount or formula
   - ip_ownership: who owns IP created under the contract

2. FLAG risks in these categories (severity: high/medium/low):
   - INDEMNIFICATION: broad or one-sided indemnification clauses
   - AUTO_RENEWAL: short notice windows or unfavorable renewal terms
   - LIABILITY_WAIVER: complete waiver or very low liability caps
   - IP_ASSIGNMENT: assignment of pre-existing IP or overly broad work-for-hire
   - TERMINATION: asymmetric termination rights or punitive exit clauses
   - EXCLUSIVITY: exclusivity that limits your business options
   - DATA_RIGHTS: vendor claiming rights to your data
   - AUDIT_RIGHTS: right for the other party to audit your systems

Return ONLY valid JSON in this exact schema:
{
  "section": "string",
  "key_terms": {
    "payment_terms": "string or null",
    "termination_notice": "string or null",
    "auto_renewal": "string or null",
    "governing_law": "string or null",
    "liability_cap": "string or null",
    "ip_ownership": "string or null"
  },
  "risk_flags": [
    {
      "category": "string",
      "severity": "high|medium|low",
      "clause_excerpt": "exact quote from contract (max 150 chars)",
      "explanation": "why this is risky in plain English",
      "recommendation": "what to negotiate or change"
    }
  ]
}"""

Step 4: Build the Extraction Function

I’m using claude-3-5-sonnet-20241022 here rather than Haiku. For contract review, the accuracy difference on subtle risk patterns (indirect indemnification, for instance) is worth the cost difference. At current Sonnet pricing (~$3/$15 per million input/output tokens), a 20-page contract costs roughly $0.04–$0.08 per full analysis. Haiku would cost ~$0.003 but misses edge cases I’ve tested. For high-volume batch work, see our guide on batch processing with the Claude API.

import anthropic
import json
import os
from pydantic import BaseModel, ValidationError
from dotenv import load_dotenv

load_dotenv()

client = anthropic.Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])

def analyze_section(section: dict) -> dict | None:
    """
    Send a single contract section to Claude for risk analysis.
    Returns parsed JSON or None on failure.
    """
    user_message = f"""Section: {section['heading']}

Content:
{section['content'][:4000]}  # Trim very long sections

Analyze this section and return JSON as specified."""

    try:
        response = client.messages.create(
            model="claude-3-5-sonnet-20241022",
            max_tokens=1500,
            system=SYSTEM_PROMPT,
            messages=[{"role": "user", "content": user_message}]
        )
        
        raw = response.content[0].text.strip()
        
        # Strip markdown code fences if Claude adds them despite instructions
        if raw.startswith("```"):
            raw = re.sub(r'^```(?:json)?\n?', '', raw)
            raw = re.sub(r'\n?```$', '', raw)
        
        return json.loads(raw)
    
    except json.JSONDecodeError as e:
        print(f"JSON parse error in section '{section['heading']}': {e}")
        return None
    except anthropic.APIError as e:
        print(f"API error: {e}")
        return None

The JSON stripping regex is not optional — Claude will occasionally wrap JSON in code fences even with explicit instructions not to. This happens more often than the docs suggest. For production deployments, pairing this with structured output verification patterns will save you debugging time.

Step 5: Add Risk Scoring Logic

Raw flags aren’t enough — you want a prioritized list. This aggregates all section results and ranks by severity so reviewers see the most critical issues first.

SEVERITY_SCORE = {"high": 3, "medium": 2, "low": 1}

def aggregate_results(section_results: list[dict]) -> dict:
    """
    Combine section-level results into a contract-level report.
    """
    all_risks = []
    all_terms = {
        "payment_terms": None,
        "termination_notice": None,
        "auto_renewal": None,
        "governing_law": None,
        "liability_cap": None,
        "ip_ownership": None
    }
    
    for result in section_results:
        if not result:
            continue
        
        # Merge key terms — first non-null value wins
        for key in all_terms:
            if all_terms[key] is None and result.get("key_terms", {}).get(key):
                all_terms[key] = result["key_terms"][key]
        
        # Collect all risk flags with section attribution
        for flag in result.get("risk_flags", []):
            flag["section"] = result.get("section", "Unknown")
            all_risks.append(flag)
    
    # Sort risks: high first, then medium, then low
    all_risks.sort(
        key=lambda x: SEVERITY_SCORE.get(x["severity"], 0),
        reverse=True
    )
    
    risk_summary = {
        "high": sum(1 for r in all_risks if r["severity"] == "high"),
        "medium": sum(1 for r in all_risks if r["severity"] == "medium"),
        "low": sum(1 for r in all_risks if r["severity"] == "low")
    }
    
    return {
        "key_terms": all_terms,
        "risk_flags": all_risks,
        "risk_summary": risk_summary
    }

Step 6: Generate the Executive Summary

The structured data is useful for programmatic processing, but most stakeholders want a plain-English paragraph they can forward. This makes a second (cheap) Claude call using the aggregated data — not the raw contract text, so it’s fast and inexpensive.

def generate_executive_summary(aggregated: dict, contract_name: str) -> str:
    """Generate a plain-English summary from structured analysis data."""
    
    prompt = f"""Based on this contract analysis data, write a concise executive summary 
(3-4 paragraphs) for a business stakeholder. Be direct about risks. 
Do not add legal disclaimers — the reader knows to consult a lawyer.

Contract: {contract_name}

Key Terms Found:
{json.dumps(aggregated['key_terms'], indent=2)}

Risk Summary: {aggregated['risk_summary']}

Top Risks:
{json.dumps(aggregated['risk_flags'][:5], indent=2)}

Write the summary now:"""

    response = client.messages.create(
        model="claude-3-haiku-20240307",  # Haiku is fine for summarization
        max_tokens=600,
        messages=[{"role": "user", "content": prompt}]
    )
    
    return response.content[0].text.strip()

Step 7: Wire It Together as a CLI Tool

import sys
import json
from pathlib import Path

def review_contract(file_path: str) -> None:
    path = Path(file_path)
    print(f"\n📄 Analyzing: {path.name}")
    
    # Extract text
    if path.suffix.lower() == ".pdf":
        text = extract_text_from_pdf(str(path))
    else:
        text = path.read_text(encoding="utf-8")
    
    # Split into sections
    sections = split_into_sections(text)
    print(f"   Found {len(sections)} sections to analyze...")
    
    # Analyze each section
    results = []
    for i, section in enumerate(sections):
        print(f"   [{i+1}/{len(sections)}] {section['heading'][:60]}")
        result = analyze_section(section)
        if result:
            results.append(result)
    
    # Aggregate
    aggregated = aggregate_results(results)
    
    # Executive summary
    summary = generate_executive_summary(aggregated, path.name)
    
    # Output
    print("\n" + "="*60)
    print("EXECUTIVE SUMMARY")
    print("="*60)
    print(summary)
    
    print("\n" + "="*60)
    print(f"RISK FLAGS: {aggregated['risk_summary']}")
    print("="*60)
    for flag in aggregated["risk_flags"]:
        icon = "🔴" if flag["severity"] == "high" else "🟡" if flag["severity"] == "medium" else "🟢"
        print(f"\n{icon} [{flag['severity'].upper()}] {flag['category']} — {flag['section']}")
        print(f"   Excerpt: \"{flag['clause_excerpt']}\"")
        print(f"   Risk: {flag['explanation']}")
        print(f"   Action: {flag['recommendation']}")
    
    print("\n" + "="*60)
    print("KEY TERMS EXTRACTED")
    print("="*60)
    for k, v in aggregated["key_terms"].items():
        print(f"  {k}: {v or 'Not found'}")
    
    # Save JSON output
    output_path = path.with_suffix(".review.json")
    with open(output_path, "w") as f:
        json.dump(aggregated, f, indent=2)
    print(f"\n✅ Full JSON saved to: {output_path}")

if __name__ == "__main__":
    if len(sys.argv) < 2:
        print("Usage: python contract_review.py <contract.pdf>")
        sys.exit(1)
    review_contract(sys.argv[1])

Run it: python contract_review.py vendor_agreement.pdf

Common Errors and How to Fix Them

JSON parsing fails intermittently

Even with explicit instructions, Claude occasionally returns JSON wrapped in code fences or with a leading sentence like “Here is the analysis:”. The regex stripping in Step 4 handles most cases. If you’re still seeing failures, switch to anthropic.types.Message with tool_use and define your schema as a tool — this forces structured output more reliably than prompt instructions alone. See the Claude tool use implementation guide for the exact pattern.

Section splitting misses large blocks of text

The regex in Step 2 targets numbered headings. Some contracts use ALL-CAPS headings without numbers (common in NDAs). Add this fallback pattern to the regex: r'(?:^|\n)([A-Z][A-Z\s]{4,}(?:\s*\n))'. If the contract is a single unsplit wall of text, skip splitting entirely and send the whole document in one call — 200K context handles 150 pages comfortably.

API timeout on very large contracts

The client.messages.create call has a default timeout. For contracts over 50 pages analyzed section-by-section, add a retry wrapper. The pattern I use is exponential backoff with jitter — exactly what’s covered in our LLM fallback and retry logic guide. Don’t skip this in production; Anthropic’s API does occasionally return 529 errors under load.

What to Build Next

Add a comparison mode: Take two contract versions (original and redlined) and diff the risk profiles. This is extremely useful for negotiation — you want to know if the new version actually fixed the liability cap issue or just reworded it. The implementation is straightforward: run review_contract() on both files, then write a third Claude call that compares the two aggregated JSON objects and highlights what changed, what improved, and what got worse. Pair this with a simple web frontend (FastAPI + HTMX works well) and you have something a legal team would actually pay for.

Frequently Asked Questions

Is a contract review Claude agent legally reliable enough to use without a lawyer?

No — and don’t pretend otherwise to stakeholders. This agent is excellent at flagging patterns and extracting terms you’d otherwise spend hours finding manually. It misses jurisdiction-specific nuance, won’t catch issues that require reading between the lines across multiple clauses, and can hallucinate on ambiguous language. Use it as a first-pass triage tool that tells a lawyer exactly where to focus, not as a replacement for legal review.

Which Claude model should I use for contract analysis — Sonnet or Haiku?

Use Sonnet for the main clause-by-clause analysis and Haiku for the executive summary generation. Sonnet catches subtle risk patterns (cross-referenced indemnification clauses, for instance) that Haiku misses in testing. The cost difference is roughly $0.04 vs $0.003 per 20-page contract — worth paying for accuracy on documents that could create legal liability. For bulk processing of low-stakes documents like simple NDAs, Haiku is fine.

How do I handle contracts that are scanned PDFs (images, not text)?

Add an OCR step before extract_text_from_pdf(). The most reliable Python option is pytesseract with pdf2image to convert pages to images first. Alternatively, use Claude’s vision capability directly by passing each page as a base64-encoded image — this is slower and more expensive but handles complex layouts better than Tesseract on messy scans.

Can I run this as part of an n8n or Make workflow?

Yes. Wrap the core functions as a FastAPI endpoint that accepts a file upload and returns the JSON analysis. Then call that endpoint from n8n’s HTTP Request node or Make’s HTTP module. For email-triggered workflows (contracts arriving via Gmail, for example), n8n’s Gmail trigger + HTTP Request node combination handles this cleanly end-to-end.

What’s the best way to improve accuracy on industry-specific contracts?

Add few-shot examples to the system prompt with 2–3 representative clause excerpts and their correct analysis. For example, if you process a lot of SaaS vendor agreements, include an example of a typical DPA clause and its correct risk assessment. This technique — covered in detail in zero-shot vs few-shot prompting benchmarks — consistently improves accuracy on domain-specific terminology without any fine-tuning.

Bottom Line: Who Should Build This

Solo founders and small teams — this is your highest-ROI AI build. You’re reviewing vendor contracts, employment agreements, and partnership deals with no legal budget. Even a rough version of this contract review Claude agent catches the auto-renewal traps and liability waivers that routinely bite early-stage companies. Build the CLI version first, use it for 30 days, then decide if you need a proper UI.

Developer agencies and consultancies — wrap this as a client-facing tool with a simple file upload interface. Legal teams at mid-market companies are actively looking for this. The structured JSON output makes it easy to integrate into existing document management systems.

Enterprise legal teams — the extraction and risk-flagging logic here is solid, but you’ll want to add PII handling, audit logging, and role-based output filtering before deploying broadly. Consider hosting on-premise or using Anthropic’s API with a data processing agreement in place if your contracts contain sensitive commercial terms.

Put this into practice

Try the Review Agent agent — ready to use, no setup required.

Browse Agents →

Editorial note: API pricing, model capabilities, and tool features change frequently — always verify current details on the vendor’s website before building in production. Code examples are tested at time of writing; pin your dependency versions to avoid breaking changes. Some links in this article may be affiliate links — we may earn a commission if you sign up, at no extra cost to you.

Contract review automation: building a Claude agent that flags risks and extracts terms

Claude MCP servers: complete setup guide for production tool integrations

Prompt token optimization: reducing LLM API costs without sacrificing quality

Building Claude agents with persistent memory: architecture for multi-session state management

Stacking multiple Claude models in a single workflow: when to use Haiku vs Sonnet vs Opus

Building Claude agents with Starlette 1.0: modern Python web framework integration

Holotron-12B for computer use agents: building high-throughput vision-based automation

Contract review automation: building a Claude agent that flags risks and extracts terms

Step 1: Install Dependencies

Step 2: Parse and Chunk the Contract

Step 3: Design the System Prompt

Step 4: Build the Extraction Function

Step 5: Add Risk Scoring Logic

Step 6: Generate the Executive Summary

Step 7: Wire It Together as a CLI Tool

Common Errors and How to Fix Them

JSON parsing fails intermittently

Section splitting misses large blocks of text

API timeout on very large contracts

What to Build Next

Frequently Asked Questions

Is a contract review Claude agent legally reliable enough to use without a lawyer?

Which Claude model should I use for contract analysis — Sonnet or Haiku?

How do I handle contracts that are scanned PDFs (images, not text)?

Can I run this as part of an n8n or Make workflow?

What’s the best way to improve accuracy on industry-specific contracts?

Bottom Line: Who Should Build This

Put this into practice

Related Claude Code Agents

Related Posts

Claude MCP servers: complete setup guide for production tool integrations

Prompt token optimization: reducing LLM API costs without sacrificing quality

Building Claude agents with persistent memory: architecture for multi-session state management

Stacking multiple Claude models in a single workflow: when to use Haiku vs Sonnet vs Opus

Building Claude agents with Starlette 1.0: modern Python web framework integration

Holotron-12B for computer use agents: building high-throughput vision-based automation