Sunday, April 5

By the end of this tutorial, you’ll have a working GitHub webhook handler that sends pull request diffs to Claude, parses structured feedback on security issues, code style, and logic bugs, then posts that feedback as a PR comment — automatically, on every push. Automated code review with Claude is one of the highest-ROI automation you can wire up for a dev team, and the full implementation runs in under 200 lines of Python.

  1. Install dependencies — Set up PyGithub, Anthropic SDK, and Flask for the webhook server
  2. Configure GitHub webhook — Register the endpoint and extract PR diff data
  3. Build the Claude review prompt — Structure the system prompt for consistent, actionable output
  4. Parse and categorise feedback — Extract security, style, and logic issues into structured JSON
  5. Post feedback as a PR comment — Write formatted Markdown back to GitHub via the API
  6. Deploy and test end-to-end — Run locally with ngrok, then move to production

Why Claude for Automated Code Review?

Static analysis tools (ESLint, Bandit, Semgrep) are great at pattern matching but terrible at understanding intent. They’ll catch an unused import but miss “this SQL query is technically parameterised but the parameter is constructed from unsanitised user input three function calls up the stack.” Claude catches that.

I’ve compared Claude against GPT-4 for this specific use case — Claude 3.5 Sonnet consistently produces more specific, actionable feedback with fewer false positives on logic issues. If you want the full breakdown, the Claude vs GPT-4 code generation benchmark covers where each model wins. For code review, Claude’s longer attention span across large diffs and its tendency to explain why something is a problem (not just flag it) makes it the better fit here.

Cost at current pricing: Claude 3.5 Sonnet runs roughly $3 per million input tokens and $15 per million output tokens. A typical 300-line PR diff with a detailed system prompt is around 2,000 tokens in and 800 tokens out — call it $0.018 per review. For a team doing 50 PRs a week, that’s under $1/week. You can drop this further by using Haiku for the initial triage pass.

Step 1: Install Dependencies

You need four things: the Anthropic Python SDK, PyGithub for the API, Flask for the webhook receiver, and python-dotenv for config management.

pip install anthropic PyGithub flask python-dotenv

Create a .env file at your project root:

ANTHROPIC_API_KEY=sk-ant-...
GITHUB_TOKEN=ghp_...
GITHUB_WEBHOOK_SECRET=your_webhook_secret_here

Step 2: Configure the GitHub Webhook

In your GitHub repo settings, go to Webhooks → Add webhook. Set the payload URL to your server endpoint (we’ll use ngrok locally), content type to application/json, and select the Pull requests event only. Generate a random secret string and put it in your .env.

Here’s the Flask receiver with HMAC signature verification — skip this and you’re accepting webhook payloads from anyone:

import hmac
import hashlib
import os
from flask import Flask, request, jsonify
from dotenv import load_dotenv

load_dotenv()
app = Flask(__name__)

def verify_signature(payload_body: bytes, signature_header: str) -> bool:
    """Verify the GitHub webhook HMAC-SHA256 signature."""
    if not signature_header:
        return False
    secret = os.getenv("GITHUB_WEBHOOK_SECRET").encode()
    expected = "sha256=" + hmac.new(secret, payload_body, hashlib.sha256).hexdigest()
    return hmac.compare_digest(expected, signature_header)

@app.route("/webhook", methods=["POST"])
def github_webhook():
    signature = request.headers.get("X-Hub-Signature-256", "")
    if not verify_signature(request.data, signature):
        return jsonify({"error": "Invalid signature"}), 401

    payload = request.json
    action = payload.get("action")
    
    # Only trigger on PR open or new commits pushed to an existing PR
    if action not in ("opened", "synchronize"):
        return jsonify({"status": "skipped"}), 200

    pr = payload["pull_request"]
    repo_full_name = payload["repository"]["full_name"]
    pr_number = pr["number"]
    
    # Kick off the review (in production, push this to a task queue)
    review_pull_request(repo_full_name, pr_number)
    return jsonify({"status": "review triggered"}), 200

Step 3: Build the Claude Review Prompt

The system prompt is where this lives or dies. Vague instructions produce vague feedback. You want Claude to output structured JSON so you can parse, filter, and format it reliably. Always ask for structured output when you need to do something programmatic with the response — this is one of the core patterns for reducing hallucinations in production systems, which we covered in depth in our guide on structured outputs and verification patterns.

SYSTEM_PROMPT = """You are a senior software engineer performing a pull request code review.
Analyse the provided diff and return a JSON object with this exact structure:

{
  "summary": "2-3 sentence overview of what this PR does",
  "security_issues": [
    {
      "severity": "critical|high|medium|low",
      "file": "path/to/file.py",
      "line_hint": "approximate line or function name",
      "issue": "what the problem is",
      "recommendation": "specific fix"
    }
  ],
  "logic_issues": [...same structure...],
  "style_issues": [...same structure, severity always 'low'...],
  "positive_observations": ["thing done well", "..."],
  "overall_verdict": "approve|request_changes|comment"
}

Rules:
- Only flag real issues. No false positives on intentional patterns.
- Security issues: SQL injection, hardcoded secrets, missing input validation, insecure deserialization, path traversal, auth bypasses.
- Logic issues: off-by-one errors, missing null checks, race conditions, incorrect error handling, broken edge cases.
- Style: only flag things that affect readability or maintainability, not personal preference.
- Return ONLY the JSON object. No markdown fences, no preamble."""

Good role prompting is half the battle here. If you’re building multiple Claude agents with consistent behaviour, the role prompting best practices guide is worth reading before you scale this pattern out.

Step 4: Fetch the Diff and Call Claude

import anthropic
from github import Github

def get_pr_diff(repo_full_name: str, pr_number: int) -> str:
    """Fetch the unified diff for a pull request via PyGithub."""
    g = Github(os.getenv("GITHUB_TOKEN"))
    repo = g.get_repo(repo_full_name)
    pr = repo.get_pull(pr_number)
    
    diff_parts = []
    for file in pr.get_files():
        if file.patch:  # Some files have no patch (binary, too large)
            diff_parts.append(f"### File: {file.filename}\n{file.patch}")
    
    return "\n\n".join(diff_parts)

def review_with_claude(diff: str) -> dict:
    """Send the diff to Claude and return parsed JSON feedback."""
    client = anthropic.Anthropic(api_key=os.getenv("ANTHROPIC_API_KEY"))
    
    # Truncate diffs that exceed ~80k tokens to avoid hitting context limits
    # Claude 3.5 Sonnet has 200k context but costs scale — cap at ~60k chars
    if len(diff) > 60_000:
        diff = diff[:60_000] + "\n\n[Diff truncated — only first 60k chars reviewed]"
    
    message = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=2048,
        system=SYSTEM_PROMPT,
        messages=[
            {
                "role": "user",
                "content": f"Please review this pull request diff:\n\n{diff}"
            }
        ]
    )
    
    import json
    return json.loads(message.content[0].text)

Step 5: Post Feedback as a PR Comment

Turn the structured JSON into readable Markdown. GitHub PR comments support tables and emoji, which makes severity levels scannable at a glance.

def format_issues(issues: list, category: str) -> str:
    """Format a list of issue dicts into a Markdown table."""
    if not issues:
        return f"*No {category} issues found.*\n"
    
    lines = ["| Severity | Location | Issue | Recommendation |",
             "|----------|----------|-------|----------------|"]
    
    severity_emoji = {"critical": "🔴", "high": "🟠", "medium": "🟡", "low": "🔵"}
    
    for issue in issues:
        emoji = severity_emoji.get(issue.get("severity", "low"), "⚪")
        location = f"`{issue.get('file', 'unknown')}` ({issue.get('line_hint', '')})"
        lines.append(
            f"| {emoji} {issue['severity']} | {location} | {issue['issue']} | {issue['recommendation']} |"
        )
    return "\n".join(lines) + "\n"

def post_review_comment(repo_full_name: str, pr_number: int, review: dict):
    """Post the formatted review as a PR comment."""
    g = Github(os.getenv("GITHUB_TOKEN"))
    repo = g.get_repo(repo_full_name)
    pr = repo.get_pull(pr_number)
    
    verdict_emoji = {"approve": "✅", "request_changes": "❌", "comment": "💬"}
    verdict = review.get("overall_verdict", "comment")
    
    body = f"""## 🤖 Claude Code Review

**Verdict:** {verdict_emoji.get(verdict, '💬')} `{verdict.upper()}`

**Summary:** {review.get('summary', 'No summary available.')}

---

### 🔒 Security Issues
{format_issues(review.get('security_issues', []), 'security')}

### 🐛 Logic Issues
{format_issues(review.get('logic_issues', []), 'logic')}

### 🎨 Style Issues
{format_issues(review.get('style_issues', []), 'style')}

### ✨ What's Done Well
{chr(10).join('- ' + obs for obs in review.get('positive_observations', []))}

---
*Generated by Claude 3.5 Sonnet — treat as advisory, not authoritative.*"""
    
    pr.create_issue_comment(body)

def review_pull_request(repo_full_name: str, pr_number: int):
    """Orchestrate the full review pipeline."""
    diff = get_pr_diff(repo_full_name, pr_number)
    if not diff.strip():
        return  # Nothing to review (e.g., PR is only doc changes)
    
    review = review_with_claude(diff)
    post_review_comment(repo_full_name, pr_number, review)

Step 6: Deploy and Test End-to-End

Run the Flask server locally and expose it with ngrok for testing:

# Terminal 1
python app.py

# Terminal 2
ngrok http 5000
# Copy the https URL, e.g. https://abc123.ngrok.io
# Set webhook URL to: https://abc123.ngrok.io/webhook

Open a test PR with a deliberate security issue — something like a hardcoded API key in a config file. Within 10-15 seconds you should see the comment appear on the PR.

For production, move the review_pull_request() call out of the request handler and into a task queue (Celery + Redis, or even a simple thread). GitHub webhooks timeout at 10 seconds; Claude takes 3-8 seconds to respond, plus network time. You don’t want missed webhooks because the review ran long on a big diff. For patterns on handling this gracefully, see our coverage of LLM fallback and retry logic in production.

Common Errors

JSON parsing fails on Claude’s response

Claude occasionally prepends or appends text outside the JSON, especially if the diff contains something confusing to parse. Fix: wrap the json.loads() call with a regex extraction fallback.

import re, json

def safe_parse_json(text: str) -> dict:
    try:
        return json.loads(text)
    except json.JSONDecodeError:
        # Extract JSON object between first { and last }
        match = re.search(r'\{.*\}', text, re.DOTALL)
        if match:
            return json.loads(match.group())
        raise ValueError(f"No valid JSON found in response: {text[:200]}")

Diff is too large and gets truncated unexpectedly

The 60k character truncation I added is conservative. The real issue is PRs with generated files (lockfiles, compiled assets) flooding the diff. Filter these before sending:

SKIP_EXTENSIONS = {'.lock', '.min.js', '.min.css', '.map', '.svg', '.png'}

for file in pr.get_files():
    ext = os.path.splitext(file.filename)[1]
    if file.patch and ext not in SKIP_EXTENSIONS:
        diff_parts.append(f"### File: {file.filename}\n{file.patch}")

GitHub token scope errors

If you see 403s when posting comments, your token needs repo scope (not just public_repo for private repos). Fine-grained tokens need explicit Pull Requests: Read and Write permission on the target repo. The error message from PyGithub is usually unhelpfully generic — check token scopes first.

What to Build Next

The natural extension is a tiered review pipeline: run Haiku first for a cheap triage pass (is this PR worth a full review? Does it touch security-sensitive files?), then invoke Sonnet only when the triage flags something meaningful. This cuts costs by 60-70% on teams with high PR volume where most changes are trivial. You can also wire this into a GitHub Actions workflow instead of a standalone webhook server — the Claude tool use guide covers the patterns for giving Claude access to additional context like test results or coverage reports, which you’d want to pull in for a more complete review.

Frequently Asked Questions

How accurate is Claude at finding real security vulnerabilities in code?

Claude is good at identifying common vulnerability classes (injection, hardcoded credentials, missing auth checks) but it’s not a replacement for a dedicated SAST tool like Semgrep or Snyk. The highest-value use is combining both: static analysis for known patterns, Claude for context-dependent logic that pattern matchers miss. Expect some false positives — tune your system prompt to reduce them over time.

Can I run automated code review Claude on private repositories?

Yes. The GitHub token just needs repo-level access. Your code diffs are sent to Anthropic’s API, so check your data handling requirements — if you’re under strict compliance rules, review Anthropic’s data processing agreement. You can also use the Anthropic API’s zero-data-retention option if available on your plan.

What’s the difference between this approach and using GitHub Copilot’s code review feature?

Copilot’s built-in review is tightly integrated into the GitHub UI but less customisable. The Claude approach lets you define exactly what you care about in the system prompt, post-process the JSON output to filter by severity, integrate with Slack/JIRA, or trigger different workflows based on verdict. You also control which model you’re using and can switch models without changing your tooling.

How do I prevent Claude from blocking PRs with too many false positives?

Don’t wire Claude’s verdict directly to GitHub’s branch protection rules at first. Run it in advisory mode (comments only) for 2-4 weeks, track false positive rate, and tune the system prompt before using it to block merges. Add a line to your prompt like “Only flag security_issues as ‘critical’ if you are highly confident — when in doubt, use ‘medium’.”

Can this handle multiple programming languages in the same PR?

Yes, Claude handles polyglot diffs well — a PR touching Python, TypeScript, and a Dockerfile in the same review will get coherent feedback on all three. Include the filename in each diff section (as the code above does) so Claude has language context. You may want language-specific system prompt sections for teams with very strict style guides in specific languages.

Put this into practice

Try the Unused Code Cleaner agent — ready to use, no setup required.

Browse Agents →

Editorial note: API pricing, model capabilities, and tool features change frequently — always verify current details on the vendor’s website before building in production. Code examples are tested at time of writing; pin your dependency versions to avoid breaking changes. Some links in this article may be affiliate links — we may earn a commission if you sign up, at no extra cost to you.


Share.
Leave A Reply