On-Call Skills for Claude Code
It's 3am. Your monitoring fires an alert. Traditionally, a human wakes up, reads the alert, checks logs, forms a hypothesis, and starts debugging. With on-call skills, Claude Code does the first 80% of that work — triaging the alert, gathering diagnostics, and presenting a clear picture to the human who makes the final call.
What on-call skills do
An on-call skill is a Claude Code skill designed for incident response. When triggered (by an alert, a webhook, or a human), it:
- Reads the alert details and identifies the affected service
- Checks relevant logs, metrics, and recent deployments
- Identifies potential root causes
- Suggests remediation steps
- Notifies the on-call engineer with a structured summary
Triage automation
Most on-call alerts fall into known categories. Claude Code can classify and respond to common patterns:
- High error rate — Check recent deployments, review error logs, identify the failing endpoint or function.
- Elevated latency — Check database query times, external API response times, and resource utilisation.
- Service down — Check deployment status, container health, and dependency availability.
- Disk/memory pressure — Identify the consuming process, check for log rotation issues, suggest cleanup.
For each category, the skill runs a diagnostic playbook and produces a structured report.
Building an on-call skill
Structure
# On-Call Triage
When triggered with an alert:
1. Parse the alert: service name, severity, metric, threshold.
2. Check recent deployments: `git log --since="6 hours ago" --oneline`
3. Check application logs for errors in the last 30 minutes.
4. Check resource utilisation (CPU, memory, disk).
5. If a recent deployment exists, compare the deployment time with the alert start time.
6. Produce a triage report with:
- Alert summary
- Probable cause (ranked by likelihood)
- Suggested immediate actions
- Whether rollback is recommended
7. Send the report to the #incidents channel via webhook. Access requirements
On-call skills need read access to:
- Application logs (Cloud Logging, ELK, etc.)
- Deployment history (git log, CI/CD platform)
- Monitoring data (metrics, dashboards)
- Service configuration (environment variables, feature flags)
They should not have write access to production systems. Diagnosis and recommendation, not remediation. The human decides whether to act on the recommendation.
Alert-triggered execution
For fully automated triage, connect your alerting system to Claude Code:
- Your monitoring tool (PagerDuty, OpsGenie, Cloud Monitoring) fires a webhook
- A lightweight service receives the webhook and starts a Claude Code session with the on-call skill
- Claude Code runs the diagnostic playbook
- The triage report is sent to your team's communication channel and the on-call engineer
The on-call engineer wakes up to a diagnosis, not a raw alert. This dramatically reduces mean time to resolution (MTTR).
Escalation patterns
- Severity-based — Low-severity alerts get automated triage only. High-severity alerts trigger both automated triage and immediate human notification.
- Confidence-based — If the on-call skill identifies a clear root cause with high confidence, it suggests a specific fix. If it's uncertain, it escalates to a human with all the data it collected.
- Time-based — If the issue isn't acknowledged within 15 minutes, escalate to the next on-call tier with the full diagnostic report.
Post-incident analysis
After an incident is resolved, Claude Code can help with the post-mortem:
- Compile a timeline from logs, alerts, and team communications
- Identify contributing factors and the root cause
- Draft action items for preventing recurrence
- Update the on-call skill with new diagnostic patterns learned from the incident
Safety considerations
- Read-only by default — On-call skills should diagnose, not fix. Automated remediation (like rollbacks) should require explicit human approval.
- Rate limiting — Prevent alert storms from spawning dozens of simultaneous Claude Code sessions.
- Cost caps — Set per-session token limits for automated triage. A runaway diagnostic session shouldn't consume your entire API budget.
- Fallback — If Claude Code can't complete the triage (API outage, rate limit, timeout), the alert must still reach a human through traditional channels.
Next steps
On-call skills work best with centralised logging (the data source for diagnostics), shared communication (delivering triage reports), and guardrails (preventing automated agents from taking production actions).