9 min read • Guide 756 of 877
On-Call Management with GitScrum
On-call shouldn't burn out your team. GitScrum helps manage on-call tasks, track incidents, and ensure fair rotation of responsibilities.
On-Call Fundamentals
Why On-Call Exists
ON-CALL PURPOSE AND GOALS:
┌─────────────────────────────────────────────────────────────┐
│ │
│ PURPOSE: │
│ Ensure someone is always available to respond to │
│ production issues that affect users. │
│ │
│ ─────────────────────────────────────────────────────────── │
│ │
│ HEALTHY ON-CALL: │
│ │
│ ✅ Fair rotation among team │
│ ✅ Clear expectations and runbooks │
│ ✅ Appropriate compensation │
│ ✅ Low false-alarm rate │
│ ✅ Sustainable workload │
│ ✅ Learning from incidents │
│ │
│ UNHEALTHY ON-CALL: │
│ │
│ ❌ Same people always on-call │
│ ❌ Constant false alarms (alert fatigue) │
│ ❌ No documentation or runbooks │
│ ❌ Expected to fix everything alone │
│ ❌ No time compensation │
│ ❌ Burnout and attrition │
│ │
│ ON-CALL PRINCIPLE: │
│ If you build it, you run it │
│ Team owns their services end-to-end │
│ Creates accountability and better design │
└─────────────────────────────────────────────────────────────┘
Rotation Setup
Building the Schedule
ON-CALL ROTATION DESIGN:
┌─────────────────────────────────────────────────────────────┐
│ │
│ ROTATION OPTIONS: │
│ │
│ WEEKLY ROTATION: │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ Week 1: @alex (primary) / @maria (secondary) ││
│ │ Week 2: @maria (primary) / @jordan (secondary) ││
│ │ Week 3: @jordan (primary) / @chen (secondary) ││
│ │ Week 4: @chen (primary) / @alex (secondary) ││
│ └─────────────────────────────────────────────────────────┘│
│ │
│ Best for: Larger teams, lower page volume │
│ │
│ DAILY ROTATION: │
│ Each person on-call for 1-2 days │
│ Best for: High page volume, more people needed │
│ │
│ FOLLOW-THE-SUN: │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ 00:00-08:00 UTC: APAC team ││
│ │ 08:00-16:00 UTC: EMEA team ││
│ │ 16:00-00:00 UTC: Americas team ││
│ └─────────────────────────────────────────────────────────┘│
│ Best for: Global teams, no overnight shifts │
│ │
│ ─────────────────────────────────────────────────────────── │
│ │
│ FAIRNESS RULES: │
│ │
│ • Rotate evenly (track in spreadsheet) │
│ • No back-to-back shifts without consent │
│ • Holiday/weekend on-call = extra consideration │
│ • Allow shift swaps │
│ • New team members shadow before solo │
│ • Compensate appropriately │
└─────────────────────────────────────────────────────────────┘
Escalation Path
ESCALATION STRUCTURE:
┌─────────────────────────────────────────────────────────────┐
│ │
│ LEVEL 1: Primary On-Call │
│ ↓ (if no response in 15 min) │
│ │
│ LEVEL 2: Secondary On-Call │
│ ↓ (if no response in 15 min) │
│ │
│ LEVEL 3: Team Lead / Manager │
│ ↓ (if SEV 1 or business impact) │
│ │
│ LEVEL 4: VP Engineering / Executive │
│ │
│ ─────────────────────────────────────────────────────────── │
│ │
│ ESCALATION RULES: │
│ │
│ AUTO-ESCALATE WHEN: │
│ • No acknowledgment in 15 minutes │
│ • Incident duration > 30 minutes │
│ • SEV 1 (always notify leadership) │
│ • Customer-reported issue │
│ │
│ PRIMARY SHOULD ESCALATE WHEN: │
│ • Outside their expertise │
│ • Need additional hands │
│ • Impact growing │
│ • Unable to resolve alone │
│ │
│ DON'T: │
│ • Hesitate to escalate │
│ • Try to be a hero │
│ • Wait too long │
└─────────────────────────────────────────────────────────────┘
On-Call Readiness
Runbooks
RUNBOOK ESSENTIALS:
┌─────────────────────────────────────────────────────────────┐
│ │
│ EVERY SERVICE NEEDS RUNBOOKS: │
│ │
│ RUNBOOK TEMPLATE: │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ Service: Payment API ││
│ │ Owner: Payments Team ││
│ │ On-call: payments-oncall@company.com ││
│ │ ││
│ │ ═══════════════════════════════════════════════════════ ││
│ │ ││
│ │ COMMON ALERTS: ││
│ │ ││
│ │ Alert: payment-api-error-rate-high ││
│ │ ───────────────────────────────────────── ││
│ │ What: Error rate > 5% ││
│ │ Why: Payment requests failing ││
│ │ ││
│ │ Steps: ││
│ │ 1. Check dashboard: [link] ││
│ │ 2. Check recent deploys: [link] ││
│ │ 3. Check downstream: Stripe status [link] ││
│ │ 4. Check database: connection pool [link] ││
│ │ ││
│ │ Common causes: ││
│ │ • Stripe outage → wait or failover ││
│ │ • Database full → scale or clean ││
│ │ • Bad deploy → rollback ││
│ │ ││
│ │ Rollback: kubectl rollout undo deploy/payment-api ││
│ │ ││
│ │ Escalate if: Unable to identify cause in 30 min ││
│ └─────────────────────────────────────────────────────────┘│
│ │
│ KEEP RUNBOOKS: │
│ • Up to date (review quarterly) │
│ • Accessible (not behind VPN) │
│ • Actionable (steps, not just info) │
│ • Linked in alerts │
└─────────────────────────────────────────────────────────────┘
On-Call Toolkit
ON-CALL ENGINEER NEEDS:
┌─────────────────────────────────────────────────────────────┐
│ │
│ ACCESS (Verified before rotation starts): │
│ │
│ ☐ Monitoring dashboards (Datadog, Grafana, etc.) │
│ ☐ Alerting system (PagerDuty, Opsgenie, etc.) │
│ ☐ Cloud console (AWS, GCP, Azure) │
│ ☐ Deployment tools (kubectl, CI/CD) │
│ ☐ Database access (read at minimum) │
│ ☐ Log aggregation (Splunk, ELK, etc.) │
│ ☐ Communication (Slack, incident channel) │
│ ☐ Status page (to post updates) │
│ ☐ Runbooks (wiki, Notion, etc.) │
│ │
│ DOCUMENTATION: │
│ │
│ ☐ Runbooks for all services │
│ ☐ Architecture diagrams │
│ ☐ Escalation contacts │
│ ☐ Vendor support contacts │
│ ☐ Previous incident post-mortems │
│ │
│ TOOLS: │
│ │
│ ☐ Laptop with VPN │
│ ☐ Mobile with alerting app │
│ ☐ Charger and backup battery │
│ ☐ Reliable internet (or backup hotspot) │
│ │
│ BEFORE SHIFT: │
│ • Verify access to all tools │
│ • Check pending incidents │
│ • Review recent changes/deploys │
│ • Confirm escalation contacts │
└─────────────────────────────────────────────────────────────┘
Sprint Planning with On-Call
Capacity Adjustment
ON-CALL CAPACITY IN GITSCRUM:
┌─────────────────────────────────────────────────────────────┐
│ │
│ SPRINT PLANNING CONSIDERATION: │
│ │
│ NORMAL CAPACITY: 28 points │
│ │
│ ON-CALL ADJUSTMENTS: │
│ │
│ Full week on-call: │
│ • -25% to -50% capacity depending on page volume │
│ • @jordan: 28 pts → 14-21 pts │
│ │
│ Half week on-call: │
│ • -10% to -25% capacity │
│ • @alex: 28 pts → 21-25 pts │
│ │
│ ─────────────────────────────────────────────────────────── │
│ │
│ SPRINT VIEW: │
│ │
│ @alex [██████████████████████░░░] 24 pts (4 pts on-call)│
│ @maria [████████████████████████░] 28 pts │
│ @jordan [██████████████░░░░░░░░░░░] 18 pts (10 pts on-call)│
│ @chen [████████████████████████░] 28 pts │
│ │
│ Total: 98 pts (vs 112 standard) │
│ │
│ ON-CALL TASKS: │
│ • Separate column or label │
│ • Don't count against velocity │
│ • Track for workload visibility │
│ │
│ SPRINT ON-CALL BUDGET: │
│ Reserve 10% of capacity for unexpected incidents │
│ If not used, pull additional work │
└─────────────────────────────────────────────────────────────┘
On-Call Handoff
ROTATION HANDOFF PROCESS:
┌─────────────────────────────────────────────────────────────┐
│ │
│ BEFORE SHIFT ENDS: │
│ │
│ OUTGOING ON-CALL PROVIDES: │
│ │
│ 1. ACTIVE ISSUES │
│ • Open incidents │
│ • Ongoing problems │
│ • Things to watch │
│ │
│ 2. RECENT CHANGES │
│ • Deploys this week │
│ • Config changes │
│ • New features launched │
│ │
│ 3. UPCOMING RISKS │
│ • Planned maintenance │
│ • Large customer events │
│ • Known risky deploys │
│ │
│ 4. LEARNINGS │
│ • New runbook entries │
│ • Gotchas discovered │
│ • Helpful tips │
│ │
│ ─────────────────────────────────────────────────────────── │
│ │
│ HANDOFF MEETING (15 min): │
│ │
│ • Sync call or written handoff │
│ • Document in Slack/wiki │
│ • Confirm incoming has access │
│ • Transfer pager/alert routing │
│ • Incoming confirms ready │
│ │
│ GITSCRUM: │
│ Handoff note linked to on-call rotation record │
└─────────────────────────────────────────────────────────────┘
Sustainability
Preventing Burnout
SUSTAINABLE ON-CALL PRACTICES:
┌─────────────────────────────────────────────────────────────┐
│ │
│ METRICS TO MONITOR: │
│ │
│ Pages per shift: │
│ Target: < 5 per week │
│ Current: 8 ⚠️ (investigate noisy alerts) │
│ │
│ Off-hours pages: │
│ Target: < 2 per week │
│ Current: 4 ⚠️ (prioritize automation) │
│ │
│ False positive rate: │
│ Target: < 10% │
│ Current: 25% ❌ (fix alerting) │
│ │
│ ─────────────────────────────────────────────────────────── │
│ │
│ REDUCING ON-CALL BURDEN: │
│ │
│ FIX NOISY ALERTS: │
│ • Review all alerts quarterly │
│ • Delete or fix low-value alerts │
│ • Tune thresholds │
│ • Add automation for common fixes │
│ │
│ IMPROVE RELIABILITY: │
│ • Fix root causes, not just symptoms │
│ • Invest in infrastructure │
│ • Better testing and canaries │
│ • Chaos engineering │
│ │
│ SUPPORT ON-CALL: │
│ • Compensate fairly │
│ • Allow time recovery after heavy shifts │
│ • Celebrate quiet weeks │
│ • Leadership does on-call too │
│ │
│ TEAM SIZE: │
│ Minimum 4-5 people for sustainable rotation │
│ Smaller = too frequent shifts │
└─────────────────────────────────────────────────────────────┘