Database Admin Agent for Claude Code: Stop Fighting Fires at 3am
Database administration is one of those disciplines where the gap between “it’s working” and “it’s working reliably” is enormous. Most developers can spin up a Postgres instance, run migrations, and call it done. But configuring replication lag alerts, testing backup restoration, setting up connection pooling, or writing a disaster recovery runbook with realistic RTO/RPO targets? That’s where hours disappear and corners get cut — usually until a production incident forces the reckoning.
The Database Admin agent for Claude Code closes that gap. It’s a specialized agent that brings operational database expertise directly into your workflow, covering the full spectrum of DBA responsibilities: backup strategies, replication configuration, user permission audits, monitoring queries, maintenance automation, and disaster recovery procedures. Whether you’re setting up a new production database or scrambling to understand why replication lag is climbing at midnight, this agent gives you a knowledgeable collaborator who thinks in terms of reliability, not just functionality.
The time savings are compounded. This agent doesn’t just answer questions — it generates complete, production-ready artifacts: backup scripts with retention policies, monitoring queries with alert thresholds, user permission matrices enforcing least privilege, and runbooks written for 3am emergencies when cognitive load is highest. These are the deliverables that experienced DBAs produce after years of painful lessons. You get them on demand.
When to Use This Agent
The agent description says to use it proactively, and that framing matters. Don’t wait for a production incident. These are the situations where reaching for this agent pays dividends:
New Database Setup
Starting a new production database environment? Use this agent before you write a single application query. It will configure replication from the start, establish backup schedules, set up connection pooling, and create users with appropriate least-privilege permissions. Retrofitting these things into an existing system is far more painful than doing them correctly upfront.
Disaster Recovery Planning
Most teams have backups. Fewer teams have tested their backups. Almost none have a documented runbook that a junior engineer can execute at 3am without calling someone. This agent generates complete disaster recovery procedures with explicit RTO and RPO targets, manual and automated recovery steps, and validation checklists. Run a tabletop exercise against the output before you need it for real.
Performance and Capacity Issues
When your database starts struggling — connection pool exhaustion, lock contention, slow queries, replication lag accumulating — you need specific monitoring queries and alert thresholds, not generic advice. This agent generates targeted SQL for diagnosing the exact issue you’re seeing and establishes monitoring baselines so you catch the next problem before it becomes an outage.
User Management and Access Audits
Security audits frequently uncover overprivileged database accounts: application users with DDL rights, shared credentials across services, accounts that were created for a one-time migration and never rotated. This agent produces a complete user permission matrix and access control review, with specific grant and revoke statements to remediate what it finds.
Replication Configuration
Setting up master-slave or multi-master replication involves configuration that varies significantly by database engine and version. Getting it wrong has consequences that don’t surface immediately. This agent handles the configuration details, monitoring setup for replication lag, and failover procedures — for both planned and unplanned switchovers.
Routine Maintenance Scheduling
Postgres tables bloat without vacuuming. MySQL requires periodic optimization. Indexes fragment. Statistics drift. These aren’t dramatic problems but they compound over time and eventually cause performance degradation that’s frustrating to diagnose. This agent builds a maintenance schedule and automation scripts so these tasks happen reliably without manual intervention.
Real-World Examples
Scenario: Establishing Backup Strategy for a New Production Database
A backend engineer has just provisioned a Postgres 15 instance for a new SaaS product. The application is going live in two weeks. They need a backup strategy before the first real user data hits the database.
Developer: We're launching a new Postgres 15 database for a SaaS product.
We need a complete backup strategy before go-live. The database
will hold customer billing and usage data, so we need strong
durability guarantees. We're running on AWS RDS.
Agent: Given billing and usage data, I'd target RPO of 1 hour and RTO
of 4 hours as minimums. Here's what I'd configure for RDS Postgres:
1. Automated backups: Enable 7-day retention on RDS (covers point-in-time
recovery). This handles your base backup layer.
2. Supplemental pg_dump snapshots to S3 for longer retention and portability:
[Generates shell script with pg_dump, S3 upload, encryption with AWS KMS,
30/90/365-day retention tiers, SNS alerting on failure]
3. WAL archiving: Confirm RDS has automated backups enabled — this enables
PITR to any second within your retention window.
4. Backup validation — untested backups don't exist. Schedule a weekly
restoration test to a separate RDS instance:
[Generates restoration script with data integrity verification queries,
CloudWatch alarm for test failures, Slack notification on completion]
5. Cross-region copy: For billing data, copy daily snapshots to a secondary
region. Here's the AWS CLI automation:
[Generates cross-region copy script with lifecycle policy]
Document your actual RTO in the runbook — walk through a full restoration
once before launch to validate the 4-hour target is achievable.
Scenario: Investigating Replication Lag Under Load
A platform team notices replication lag climbing during peak hours. Their read replicas serve a reporting workload and lag is causing stale data issues.
Developer: Our Postgres read replica is showing 30-60 seconds of replication
lag during peak hours (2-4pm). It's causing problems for our
reporting queries which are reading stale data. How do we diagnose
and fix this?
Agent: This pattern — lag accumulating during peak write hours — usually points
to one of three causes: replica CPU/IO saturation, long-running queries
blocking WAL application, or network throughput. Let's diagnose first.
Run these on the replica:
[Generates monitoring queries for pg_stat_replication, WAL receiver
status, active locks blocking replay, replica system resource usage]
While that runs, check the primary:
[Generates queries for WAL generation rate, write-heavy tables,
lock wait events]
Based on what you find:
If replica is CPU/IO bound: Consider upgrading instance class, or
separating the reporting workload to a dedicated replica. Here's how
to add a second replica in RDS:
[Configuration steps]
If long-running queries are blocking WAL replay: Set
max_standby_streaming_delay and max_standby_archive_delay, and
enforce statement_timeout on reporting connections:
[Configuration values with reasoning]
If it's write volume: Enable synchronous_commit = off for non-critical
writes on the primary (explain the durability tradeoff to your team first).
For alerting going forward, set a CloudWatch alarm on ReplicaLag > 10
seconds so you catch this before it hits 60. Here's the alarm config:
[Generates CloudWatch alarm JSON]
What Makes This Agent Powerful
Operational Mindset, Not Just Syntax
This agent is explicitly oriented around reliability and operational excellence — not just getting something working. Its approach encodes hard-won DBA wisdom: test your backups, monitor replication lag, document procedures for 3am, plan capacity before you need it. The outputs reflect this mindset. You won’t get a backup script that doesn’t include restoration testing.
Complete, Runnable Artifacts
The agent produces shell scripts, SQL queries, configuration files, and runbooks — not explanations of what those things should contain. The output is immediately usable. Backup scripts include retention policies and failure alerting. Monitoring queries include threshold recommendations. Permission matrices include the specific SQL to implement them.
Both Manual and Automated Recovery
The agent explicitly covers both automated and manual recovery procedures. Automation handles the normal case. The manual runbook handles the case where the automation has failed or the situation is unusual. Both matter for disaster recovery, and this agent produces both.
Least-Privilege Security Model
User management output defaults to least privilege — application users get exactly the permissions their queries require, not broad grants that accumulate over time. The permission matrix makes the access model auditable and reviewable, not buried in ad-hoc grant statements across migration files.
RTO/RPO Driven Planning
Disaster recovery procedures include explicit RTO and RPO targets. This forces the conversation about what “acceptable” recovery actually means for your system before an incident makes it urgent. It also gives you a basis for validating your backup and recovery configuration against real requirements.
How to Install
Installing the Database Admin agent takes about sixty seconds. In your project root, create the directory and file:
.claude/agents/database-admin.md
Paste the following system prompt as the file contents:
---
name: database-admin
description: Database administration specialist for operations, backups, replication, and monitoring. Use PROACTIVELY for database setup, operational issues, user management, or disaster recovery procedures.
---
You are a database administrator specializing in operational excellence and reliability.
## Focus Areas
- Backup strategies and disaster recovery
- Replication setup (master-slave, multi-master)
- User management and access control
- Performance monitoring and alerting
- Database maintenance (vacuum, analyze, optimize)
- High availability and failover procedures
## Approach
1. Automate routine maintenance tasks
2. Test backups regularly - untested backups don't exist
3. Monitor key metrics (connections, locks, replication lag)
4. Document procedures for 3am emergencies
5. Plan capacity before hitting limits
## Output
- Backup scripts with retention policies
- Replication configuration and monitoring
- User permission matrix with least privilege
- Monitoring queries and alert thresholds
- Maintenance schedule and automation
- Disaster recovery runbook with RTO/RPO
Include connection pooling setup. Show both automated and manual recovery steps.
Claude Code automatically discovers and loads agents from the .claude/agents/ directory. No registration step, no configuration file to update. The agent is available immediately in your next Claude Code session. You can also commit this file to your repository so your entire team has access to the same agent configuration.
Conclusion and Next Steps
The Database Admin agent is most valuable when used before you need it, not during an incident. Here’s a practical sequence for getting value from it immediately:
- Install the agent today. The file creation takes sixty seconds.
- Audit your current backup strategy. Describe your existing setup to the agent and ask for gaps. Most teams find at least one.
- Request a disaster recovery runbook for your most critical database. Walk through it with your team as a tabletop exercise.
- Generate monitoring queries for replication lag, connection counts, and lock contention, and get them into your existing observability stack.
- Run a user permission audit. Ask the agent to help you review current grants against least-privilege requirements.
The goal isn’t to replace operational expertise — it’s to make that expertise accessible at the moment you need it, with outputs you can put directly into production. For teams without a dedicated DBA, this agent makes reliable database operations achievable. For teams that do have DBA expertise, it accelerates the documentation and automation work that always gets deprioritized until an incident makes it urgent.
Agent template sourced from the claude-code-templates open source project (MIT License).
