Monitoring Specialist: The Claude Code Agent That Builds Your Entire Observability Stack

Every senior developer has been there: it’s 2 AM, an alert fires, and you’re staring at a black hole of missing metrics, misconfigured log pipelines, and dashboards that show everything except what you actually need. Observability debt is brutal, and the worst part is that setting up a proper monitoring stack is genuinely time-consuming — not intellectually hard, but tedious. Writing Prometheus rules, wiring up OpenTelemetry, configuring Loki pipelines, creating Grafana dashboards, and documenting runbooks can eat a full sprint if you’re doing it right.

That’s the exact problem the Monitoring Specialist agent solves. This Claude Code agent is purpose-built for observability infrastructure. Instead of spending hours searching documentation, assembling configuration fragments from Stack Overflow, and manually cross-referencing alert thresholds with your SLOs, you describe your system and get back production-ready monitoring configuration — complete with retention policies, cost optimization notes, and runbooks.

The agent doesn’t just generate config files. It reasons about your stack through established frameworks: the Four Golden Signals, the RED method, the USE method. It knows when to alert on symptoms rather than causes. It groups alerts to minimize noise. If you’ve ever inherited a monitoring setup that pages on CPU usage instead of user-facing latency, you understand why that distinction matters at 2 AM.

When to Use the Monitoring Specialist

This agent is marked as PROACTIVE — meaning you should reach for it early in infrastructure work, not as a last resort when things break. Here are the scenarios where it pays the most dividends:

Greenfield Service Instrumentation

You’re building a new microservice and need to instrument it from scratch. Rather than treating observability as an afterthought, use this agent at the start to generate OpenTelemetry setup, define SLOs, and create the Prometheus metrics your service should expose before you write a single handler.

Alert Fatigue Remediation

Your on-call rotation is miserable because the alerting rules are a mess — too many low-signal pages, alerts that fire on causes rather than symptoms, no grouping logic. Hand the agent your current alert configuration and ask it to rationalize it against the Four Golden Signals framework. You’ll get restructured rules with proper severity levels and smart grouping.

Log Pipeline Design

You need to aggregate logs from a heterogeneous environment — Kubernetes pods, VM-based services, cloud functions — and route them into something queryable. The agent handles Fluentd, Loki, and ELK configurations, including log parsing rules and index lifecycle policies that won’t destroy your storage budget.

Distributed Tracing Adoption

Your team wants to move from printf debugging to proper distributed tracing but nobody has set up Jaeger or Zipkin before. The agent generates OpenTelemetry instrumentation boilerplate, collector configurations, and sampling strategies appropriate for your traffic volume.

SLA/SLO Reporting Automation

Your engineering manager wants weekly SLA reports and you’re currently generating them manually. The agent can build Prometheus recording rules, error budget calculations, and Grafana dashboard panels that automate this entirely.

Pre-Production Readiness Reviews

Before a major release, use the agent to audit your observability coverage. It can identify gaps — services with no health endpoints, missing latency histograms, untracked background jobs — and generate the configuration to close them.

Real-World Examples

Scenario: Building a Prometheus + Grafana Stack for a Django API

A backend engineer is preparing a Django REST API for production. They have basic Django metrics exposed but no alerting, no dashboards, and no SLO definitions.

Developer: I'm deploying a Django REST API with Celery workers to Kubernetes.
I have django-prometheus installed and metrics are being scraped, but I have
no alerting rules, no dashboards, and no SLO defined. Traffic is roughly
500 req/s at peak. Help me build out the full monitoring layer.

Agent: I'll build you a complete monitoring stack based on the Four Golden
Signals and RED method for your API layer, with USE method coverage for
the Celery workers.

Here's your Prometheus alerting rules file:

groups:
  - name: django-api.rules
    interval: 30s
    rules:
      - alert: HighErrorRate
        expr: |
          sum(rate(django_http_responses_total_by_status_total{status=~"5.."}[5m]))
          /
          sum(rate(django_http_responses_total_by_status_total[5m])) > 0.01
        for: 2m
        labels:
          severity: critical
          team: backend
        annotations:
          summary: "API error rate above 1% SLO threshold"
          runbook: "https://wiki.internal/runbooks/django-high-error-rate"

      - alert: HighP99Latency
        expr: |
          histogram_quantile(0.99,
            sum(rate(django_http_responses_latency_seconds_bucket[5m])) by (le, view)
          ) > 2.0
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "P99 latency above 2s on {{ $labels.view }}"

  - name: celery-workers.rules
    rules:
      - alert: CeleryQueueDepthHigh
        expr: celery_queue_length > 1000
        for: 3m
        labels:
          severity: warning
        annotations:
          summary: "Celery queue depth exceeding 1000 tasks — worker saturation likely"

For your SLO error budget, I'll add recording rules:

groups:
  - name: slo.recording
    rules:
      - record: job:django_request_error_rate:ratio_rate5m
        expr: |
          sum(rate(django_http_responses_total_by_status_total{status=~"5.."}[5m]))
          /
          sum(rate(django_http_responses_total_by_status_total[5m]))

      - record: job:error_budget_remaining:ratio
        expr: 1 - (job:django_request_error_rate:ratio_rate5m / 0.01)

I'll also generate a Grafana dashboard JSON with panels for request rate,
error rate, P50/P95/P99 latency distribution, Celery queue depth, and
worker utilization. Retention recommendation: 15-day raw retention in
Prometheus with 1-year downsampled in Thanos or Cortex to control storage costs.

Want me to generate the full Grafana dashboard JSON and the OpenTelemetry
collector config for distributed tracing on your critical API views?

Scenario: Rationalizing an Overloaded Alert Configuration

A platform engineer is dealing with alert fatigue. Their team is being paged dozens of times per week on low-signal alerts.

Developer: Our on-call is getting destroyed. We have 47 active alert rules,
most of them firing on infrastructure metrics like CPU and memory. Our
error budget is basically undefined and nobody knows what actually requires
a 3 AM page vs. a Slack notification. Can you help me restructure this?

Agent: This is a classic alert-on-causes problem. Let me apply a symptom-based
model. First, the principle: page humans only for conditions that are
user-facing and require immediate human judgment. Everything else should
be a ticket or a Slack message.

I'll restructure your alerts into three tiers:

TIER 1 — PAGE (requires immediate human response):
- User-facing error rate > SLO threshold for > 2 minutes
- P99 latency > SLO threshold for > 5 minutes
- Complete service unavailability (no successful requests for 1 minute)

TIER 2 — SLACK WARNING (investigate during business hours):
- Error budget burn rate exceeding 5x for 1 hour
- Dependency saturation approaching limits (disk > 85%, connection pool > 80%)
- Deployment anomaly detected (error rate spike within 10 min of deploy)

TIER 3 — TICKET / DASHBOARD ONLY:
- CPU, memory, individual pod restarts (unless causing Tier 1/2 symptoms)
- Cron job duration drift
- Log volume anomalies

For your CPU and memory alerts specifically: CPU usage is almost never
worth a page. High CPU is a cause, not a symptom. Replace with:

  - alert: ServiceSaturationWarning
    expr: |
      (
        sum(rate(http_requests_total{status=~"5.."}[5m])) > 0
        and
        sum(rate(container_cpu_usage_seconds_total[5m])) /
        sum(container_spec_cpu_quota / container_spec_cpu_period) > 0.9
      )
    annotations:
      summary: "High CPU coinciding with errors — likely saturation causing failures"

This fires only when CPU is high AND errors are happening simultaneously,
making it actionable. I'll generate a full restructured alert file and
an AlertManager grouping configuration that suppresses child alerts
when a parent service-level alert is already firing.

What Makes This Agent Powerful

Framework-Driven Reasoning

The agent doesn’t guess at what to monitor. It applies established methodologies — Four Golden Signals for service health, RED for request-driven services, USE for infrastructure resources — and selects the right lens for the context you provide. This means its output reflects how experienced SREs actually think about observability, not a generic checklist.

Full-Stack Output

Most documentation covers one layer at a time. This agent generates complete, integrated configurations: Prometheus scrape configs, alerting rules, recording rules for SLOs, Grafana dashboard JSON, Fluentd or Loki pipeline configs, OpenTelemetry collector YAML, and AlertManager routing trees. Everything you need to close the loop from instrumentation to incident response.

Cost Awareness

Observability infrastructure has real costs — storage, cardinality, ingestion pricing. The agent includes retention policies and cost optimization strategies in its output. High-cardinality metrics, over-broad label sets, and infinite retention are common ways teams blow their observability budget; this agent flags and mitigates those patterns by default.

Runbook Generation

Every alert it produces can be paired with a runbook. Rather than leaving on-call engineers to figure out remediation steps under pressure, the agent generates structured runbook templates covering diagnosis steps, common causes, and resolution procedures — directly linked from alert annotations.

Alert Fatigue Prevention

The agent’s core heuristic — alert on symptoms, not causes — is enforced throughout. It applies smart grouping, burn rate calculations, and severity tiers to keep your on-call rotation sane. This isn’t just a configuration detail; it’s the difference between a monitoring setup that helps you and one that trains your team to ignore pages.

How to Install the Monitoring Specialist

Installation is straightforward. Claude Code automatically discovers agent files placed in the .claude/agents/ directory of your project or home folder.

Create the agent file at the following path:

.claude/agents/monitoring-specialist.md

Paste the following system prompt as the file contents:

---
name: Monitoring Specialist
description: Monitoring and observability infrastructure specialist. Use PROACTIVELY for metrics collection, alerting systems, log aggregation, distributed tracing, SLA monitoring, and performance dashboards.
---

You are a monitoring specialist focused on observability infrastructure and performance analytics.

## Focus Areas

- Metrics collection (Prometheus, InfluxDB, DataDog)
- Log aggregation and analysis (ELK, Fluentd, Loki)
- Distributed tracing (Jaeger, Zipkin, OpenTelemetry)
- Alerting and notification systems
- Dashboard creation and visualization
- SLA/SLO monitoring and incident response

## Approach

1. Four Golden Signals: latency, traffic, errors, saturation
2. RED method: Rate, Errors, Duration
3. USE method: Utilization, Saturation, Errors
4. Alert on symptoms, not causes
5. Minimize alert fatigue with smart grouping

## Output

- Complete monitoring stack configuration
- Prometheus rules and Grafana dashboards
- Log parsing and alerting rules
- OpenTelemetry instrumentation setup
- SLA monitoring and reporting automation
- Runbooks for common alert scenarios

Include retention policies and cost optimization strategies. Focus on actionable alerts only.

Once the file is saved, Claude Code loads it automatically. You can invoke the agent directly in any session by referencing it: use the monitoring-specialist agent to... — or let Claude Code select it automatically when your request matches observability and infrastructure topics.

The agent file can live at the project level (inside a specific repository’s .claude/agents/ folder) or at the global level in your home directory’s ~/.claude/agents/ folder, making it available across all projects.

Practical Next Steps

Install the agent today and put it to work on something concrete. If you have a service going to production in the next two weeks, run the agent against it now — ask it to audit your observability coverage and generate the missing configuration. If your on-call rotation is painful, paste your current alert rules and ask the agent to restructure them using the symptom-based tiering approach. If you’ve been procrastinating on SLO definitions, let the agent generate error budget recording rules and a burn rate dashboard as a starting point.

Observability is one of those disciplines where doing it right early saves enormous pain later. The Monitoring Specialist agent removes the friction that makes teams defer it. The configuration it generates isn’t a prototype — it’s production-grade, framework-grounded, and ready to adapt to your specific stack. Ship it, tune the thresholds for your actual traffic patterns, and stop treating monitoring as something you’ll get to eventually.

Agent template sourced from the claude-code-templates open source project (MIT License).

Monitoring Specialist — Claude Code Agent

Claude MCP servers: complete setup guide for production tool integrations

Prompt token optimization: reducing LLM API costs without sacrificing quality

Building Claude agents with persistent memory: architecture for multi-session state management

Stacking multiple Claude models in a single workflow: when to use Haiku vs Sonnet vs Opus

Building Claude agents with Starlette 1.0: modern Python web framework integration

Holotron-12B for computer use agents: building high-throughput vision-based automation