On-Call Skills for Claude Code

It's 3am. Your monitoring fires an alert. Traditionally, a human wakes up, reads the alert, checks logs, forms a hypothesis, and starts debugging. With on-call skills, Claude Code does the first 80% of that work — triaging the alert, gathering diagnostics, and presenting a clear picture to the human who makes the final call.

What on-call skills do

An on-call skill is a Claude Code skill designed for incident response. When triggered (by an alert, a webhook, or a human), it:

Reads the alert details and identifies the affected service
Checks relevant logs, metrics, and recent deployments
Identifies potential root causes
Suggests remediation steps
Notifies the on-call engineer with a structured summary

Triage automation

Most on-call alerts fall into known categories. Claude Code can classify and respond to common patterns:

High error rate — Check recent deployments, review error logs, identify the failing endpoint or function.
Elevated latency — Check database query times, external API response times, and resource utilisation.
Service down — Check deployment status, container health, and dependency availability.
Disk/memory pressure — Identify the consuming process, check for log rotation issues, suggest cleanup.

For each category, the skill runs a diagnostic playbook and produces a structured report.

Building an on-call skill

Structure

# On-Call Triage

When triggered with an alert:

1. Parse the alert: service name, severity, metric, threshold.
2. Check recent deployments: `git log --since="6 hours ago" --oneline`
3. Check application logs for errors in the last 30 minutes.
4. Check resource utilisation (CPU, memory, disk).
5. If a recent deployment exists, compare the deployment time with the alert start time.
6. Produce a triage report with:
   - Alert summary
   - Probable cause (ranked by likelihood)
   - Suggested immediate actions
   - Whether rollback is recommended
7. Send the report to the #incidents channel via webhook.

Access requirements

On-call skills need read access to:

Application logs (Cloud Logging, ELK, etc.)
Deployment history (git log, CI/CD platform)
Monitoring data (metrics, dashboards)
Service configuration (environment variables, feature flags)

They should not have write access to production systems. Diagnosis and recommendation, not remediation. The human decides whether to act on the recommendation.

Alert-triggered execution

For fully automated triage, connect your alerting system to Claude Code:

Your monitoring tool (PagerDuty, OpsGenie, Cloud Monitoring) fires a webhook
A lightweight service receives the webhook and starts a Claude Code session with the on-call skill
Claude Code runs the diagnostic playbook
The triage report is sent to your team's communication channel and the on-call engineer

The on-call engineer wakes up to a diagnosis, not a raw alert. This dramatically reduces mean time to resolution (MTTR).

Escalation patterns

Severity-based — Low-severity alerts get automated triage only. High-severity alerts trigger both automated triage and immediate human notification.
Confidence-based — If the on-call skill identifies a clear root cause with high confidence, it suggests a specific fix. If it's uncertain, it escalates to a human with all the data it collected.
Time-based — If the issue isn't acknowledged within 15 minutes, escalate to the next on-call tier with the full diagnostic report.

Post-incident analysis

After an incident is resolved, Claude Code can help with the post-mortem:

Compile a timeline from logs, alerts, and team communications
Identify contributing factors and the root cause
Draft action items for preventing recurrence
Update the on-call skill with new diagnostic patterns learned from the incident

Safety considerations

Read-only by default — On-call skills should diagnose, not fix. Automated remediation (like rollbacks) should require explicit human approval.
Rate limiting — Prevent alert storms from spawning dozens of simultaneous Claude Code sessions.
Cost caps — Set per-session token limits for automated triage. A runaway diagnostic session shouldn't consume your entire API budget.
Fallback — If Claude Code can't complete the triage (API outage, rate limit, timeout), the alert must still reach a human through traditional channels.

Next steps

On-call skills work best with centralised logging (the data source for diagnostics), shared communication (delivering triage reports), and guardrails (preventing automated agents from taking production actions).