On-Call Skills for Claude Code

It's 3am. Your monitoring fires an alert. Traditionally, a human wakes up, reads the alert, checks logs, forms a hypothesis, and starts debugging. With on-call skills, Claude Code does the first 80% of that work — triaging the alert, gathering diagnostics, and presenting a clear picture to the human who makes the final call.

What on-call skills do

An on-call skill is a Claude Code skill designed for incident response. When triggered (by an alert, a webhook, or a human), it:

  1. Reads the alert details and identifies the affected service
  2. Checks relevant logs, metrics, and recent deployments
  3. Identifies potential root causes
  4. Suggests remediation steps
  5. Notifies the on-call engineer with a structured summary

Triage automation

Most on-call alerts fall into known categories. Claude Code can classify and respond to common patterns:

  • High error rate — Check recent deployments, review error logs, identify the failing endpoint or function.
  • Elevated latency — Check database query times, external API response times, and resource utilisation.
  • Service down — Check deployment status, container health, and dependency availability.
  • Disk/memory pressure — Identify the consuming process, check for log rotation issues, suggest cleanup.

For each category, the skill runs a diagnostic playbook and produces a structured report.

Building an on-call skill

Structure

# On-Call Triage

When triggered with an alert:

1. Parse the alert: service name, severity, metric, threshold.
2. Check recent deployments: `git log --since="6 hours ago" --oneline`
3. Check application logs for errors in the last 30 minutes.
4. Check resource utilisation (CPU, memory, disk).
5. If a recent deployment exists, compare the deployment time with the alert start time.
6. Produce a triage report with:
   - Alert summary
   - Probable cause (ranked by likelihood)
   - Suggested immediate actions
   - Whether rollback is recommended
7. Send the report to the #incidents channel via webhook.

Access requirements

On-call skills need read access to:

  • Application logs (Cloud Logging, ELK, etc.)
  • Deployment history (git log, CI/CD platform)
  • Monitoring data (metrics, dashboards)
  • Service configuration (environment variables, feature flags)

They should not have write access to production systems. Diagnosis and recommendation, not remediation. The human decides whether to act on the recommendation.

Alert-triggered execution

For fully automated triage, connect your alerting system to Claude Code:

  1. Your monitoring tool (PagerDuty, OpsGenie, Cloud Monitoring) fires a webhook
  2. A lightweight service receives the webhook and starts a Claude Code session with the on-call skill
  3. Claude Code runs the diagnostic playbook
  4. The triage report is sent to your team's communication channel and the on-call engineer

The on-call engineer wakes up to a diagnosis, not a raw alert. This dramatically reduces mean time to resolution (MTTR).

Escalation patterns

  • Severity-based — Low-severity alerts get automated triage only. High-severity alerts trigger both automated triage and immediate human notification.
  • Confidence-based — If the on-call skill identifies a clear root cause with high confidence, it suggests a specific fix. If it's uncertain, it escalates to a human with all the data it collected.
  • Time-based — If the issue isn't acknowledged within 15 minutes, escalate to the next on-call tier with the full diagnostic report.

Post-incident analysis

After an incident is resolved, Claude Code can help with the post-mortem:

  • Compile a timeline from logs, alerts, and team communications
  • Identify contributing factors and the root cause
  • Draft action items for preventing recurrence
  • Update the on-call skill with new diagnostic patterns learned from the incident

Safety considerations

  • Read-only by default — On-call skills should diagnose, not fix. Automated remediation (like rollbacks) should require explicit human approval.
  • Rate limiting — Prevent alert storms from spawning dozens of simultaneous Claude Code sessions.
  • Cost caps — Set per-session token limits for automated triage. A runaway diagnostic session shouldn't consume your entire API budget.
  • Fallback — If Claude Code can't complete the triage (API outage, rate limit, timeout), the alert must still reach a human through traditional channels.

Next steps

On-call skills work best with centralised logging (the data source for diagnostics), shared communication (delivering triage reports), and guardrails (preventing automated agents from taking production actions).