On-Call Management with GitScrum | Rotations & Runbooks

Run fair on-call rotations without burnout. Set escalation paths, create runbooks, adjust sprint capacity, and manage handoffs with GitScrum.

9 min read

On-call shouldn't burn out your team. GitScrum helps manage on-call tasks, track incidents, and ensure fair rotation of responsibilities.

On-Call Fundamentals

Why On-Call Exists

ON-CALL PURPOSE AND GOALS:
┌─────────────────────────────────────────────────────────────┐
│                                                             │
│ PURPOSE:                                                    │
│ Ensure someone is always available to respond to          │
│ production issues that affect users.                      │
│                                                             │
│ ─────────────────────────────────────────────────────────── │
│                                                             │
│ HEALTHY ON-CALL:                                            │
│                                                             │
│ ✅ Fair rotation among team                               │
│ ✅ Clear expectations and runbooks                        │
│ ✅ Appropriate compensation                               │
│ ✅ Low false-alarm rate                                   │
│ ✅ Sustainable workload                                    │
│ ✅ Learning from incidents                                │
│                                                             │
│ UNHEALTHY ON-CALL:                                          │
│                                                             │
│ ❌ Same people always on-call                             │
│ ❌ Constant false alarms (alert fatigue)                 │
│ ❌ No documentation or runbooks                           │
│ ❌ Expected to fix everything alone                       │
│ ❌ No time compensation                                   │
│ ❌ Burnout and attrition                                  │
│                                                             │
│ ON-CALL PRINCIPLE:                                          │
│ If you build it, you run it                               │
│ Team owns their services end-to-end                       │
│ Creates accountability and better design                  │
└─────────────────────────────────────────────────────────────┘

Rotation Setup

Building the Schedule

ON-CALL ROTATION DESIGN:
┌─────────────────────────────────────────────────────────────┐
│                                                             │
│ ROTATION OPTIONS:                                           │
│                                                             │
│ WEEKLY ROTATION:                                            │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ Week 1: @alex (primary) / @maria (secondary)           ││
│ │ Week 2: @maria (primary) / @jordan (secondary)         ││
│ │ Week 3: @jordan (primary) / @chen (secondary)          ││
│ │ Week 4: @chen (primary) / @alex (secondary)            ││
│ └─────────────────────────────────────────────────────────┘│
│                                                             │
│ Best for: Larger teams, lower page volume                 │
│                                                             │
│ DAILY ROTATION:                                             │
│ Each person on-call for 1-2 days                         │
│ Best for: High page volume, more people needed           │
│                                                             │
│ FOLLOW-THE-SUN:                                             │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ 00:00-08:00 UTC: APAC team                             ││
│ │ 08:00-16:00 UTC: EMEA team                             ││
│ │ 16:00-00:00 UTC: Americas team                         ││
│ └─────────────────────────────────────────────────────────┘│
│ Best for: Global teams, no overnight shifts              │
│                                                             │
│ ─────────────────────────────────────────────────────────── │
│                                                             │
│ FAIRNESS RULES:                                             │
│                                                             │
│ • Rotate evenly (track in spreadsheet)                   │
│ • No back-to-back shifts without consent                 │
│ • Holiday/weekend on-call = extra consideration          │
│ • Allow shift swaps                                       │
│ • New team members shadow before solo                    │
│ • Compensate appropriately                                │
└─────────────────────────────────────────────────────────────┘

Escalation Path

ESCALATION STRUCTURE:
┌─────────────────────────────────────────────────────────────┐
│                                                             │
│ LEVEL 1: Primary On-Call                                   │
│ ↓ (if no response in 15 min)                              │
│                                                             │
│ LEVEL 2: Secondary On-Call                                 │
│ ↓ (if no response in 15 min)                              │
│                                                             │
│ LEVEL 3: Team Lead / Manager                              │
│ ↓ (if SEV 1 or business impact)                          │
│                                                             │
│ LEVEL 4: VP Engineering / Executive                       │
│                                                             │
│ ─────────────────────────────────────────────────────────── │
│                                                             │
│ ESCALATION RULES:                                           │
│                                                             │
│ AUTO-ESCALATE WHEN:                                         │
│ • No acknowledgment in 15 minutes                         │
│ • Incident duration > 30 minutes                         │
│ • SEV 1 (always notify leadership)                       │
│ • Customer-reported issue                                 │
│                                                             │
│ PRIMARY SHOULD ESCALATE WHEN:                               │
│ • Outside their expertise                                 │
│ • Need additional hands                                   │
│ • Impact growing                                          │
│ • Unable to resolve alone                                 │
│                                                             │
│ DON'T:                                                      │
│ • Hesitate to escalate                                    │
│ • Try to be a hero                                        │
│ • Wait too long                                           │
└─────────────────────────────────────────────────────────────┘

On-Call Readiness

Runbooks

RUNBOOK ESSENTIALS:
┌─────────────────────────────────────────────────────────────┐
│                                                             │
│ EVERY SERVICE NEEDS RUNBOOKS:                               │
│                                                             │
│ RUNBOOK TEMPLATE:                                           │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ Service: Payment API                                    ││
│ │ Owner: Payments Team                                    ││
│ │ On-call: payments-oncall@company.com                   ││
│ │                                                         ││
│ │ ═══════════════════════════════════════════════════════ ││
│ │                                                         ││
│ │ COMMON ALERTS:                                          ││
│ │                                                         ││
│ │ Alert: payment-api-error-rate-high                     ││
│ │ ─────────────────────────────────────────              ││
│ │ What: Error rate > 5%                                  ││
│ │ Why: Payment requests failing                          ││
│ │                                                         ││
│ │ Steps:                                                  ││
│ │ 1. Check dashboard: [link]                             ││
│ │ 2. Check recent deploys: [link]                        ││
│ │ 3. Check downstream: Stripe status [link]             ││
│ │ 4. Check database: connection pool [link]             ││
│ │                                                         ││
│ │ Common causes:                                          ││
│ │ • Stripe outage → wait or failover                    ││
│ │ • Database full → scale or clean                      ││
│ │ • Bad deploy → rollback                               ││
│ │                                                         ││
│ │ Rollback: kubectl rollout undo deploy/payment-api     ││
│ │                                                         ││
│ │ Escalate if: Unable to identify cause in 30 min      ││
│ └─────────────────────────────────────────────────────────┘│
│                                                             │
│ KEEP RUNBOOKS:                                              │
│ • Up to date (review quarterly)                           │
│ • Accessible (not behind VPN)                             │
│ • Actionable (steps, not just info)                      │
│ • Linked in alerts                                        │
└─────────────────────────────────────────────────────────────┘

On-Call Toolkit

ON-CALL ENGINEER NEEDS:
┌─────────────────────────────────────────────────────────────┐
│                                                             │
│ ACCESS (Verified before rotation starts):                  │
│                                                             │
│ ☐ Monitoring dashboards (Datadog, Grafana, etc.)         │
│ ☐ Alerting system (PagerDuty, Opsgenie, etc.)            │
│ ☐ Cloud console (AWS, GCP, Azure)                        │
│ ☐ Deployment tools (kubectl, CI/CD)                      │
│ ☐ Database access (read at minimum)                      │
│ ☐ Log aggregation (Splunk, ELK, etc.)                    │
│ ☐ Communication (Slack, incident channel)                │
│ ☐ Status page (to post updates)                          │
│ ☐ Runbooks (wiki, Notion, etc.)                          │
│                                                             │
│ DOCUMENTATION:                                              │
│                                                             │
│ ☐ Runbooks for all services                              │
│ ☐ Architecture diagrams                                   │
│ ☐ Escalation contacts                                     │
│ ☐ Vendor support contacts                                │
│ ☐ Previous incident post-mortems                         │
│                                                             │
│ TOOLS:                                                      │
│                                                             │
│ ☐ Laptop with VPN                                        │
│ ☐ Mobile with alerting app                               │
│ ☐ Charger and backup battery                             │
│ ☐ Reliable internet (or backup hotspot)                 │
│                                                             │
│ BEFORE SHIFT:                                               │
│ • Verify access to all tools                              │
│ • Check pending incidents                                  │
│ • Review recent changes/deploys                           │
│ • Confirm escalation contacts                             │
└─────────────────────────────────────────────────────────────┘

Sprint Planning with On-Call

Capacity Adjustment

ON-CALL CAPACITY IN GITSCRUM:
┌─────────────────────────────────────────────────────────────┐
│                                                             │
│ SPRINT PLANNING CONSIDERATION:                              │
│                                                             │
│ NORMAL CAPACITY: 28 points                                 │
│                                                             │
│ ON-CALL ADJUSTMENTS:                                        │
│                                                             │
│ Full week on-call:                                        │
│ • -25% to -50% capacity depending on page volume         │
│ • @jordan: 28 pts → 14-21 pts                            │
│                                                             │
│ Half week on-call:                                         │
│ • -10% to -25% capacity                                   │
│ • @alex: 28 pts → 21-25 pts                              │
│                                                             │
│ ─────────────────────────────────────────────────────────── │
│                                                             │
│ SPRINT VIEW:                                                │
│                                                             │
│ @alex    [██████████████████████░░░] 24 pts (4 pts on-call)│
│ @maria   [████████████████████████░] 28 pts                │
│ @jordan  [██████████████░░░░░░░░░░░] 18 pts (10 pts on-call)│
│ @chen    [████████████████████████░] 28 pts                │
│                                                             │
│ Total: 98 pts (vs 112 standard)                           │
│                                                             │
│ ON-CALL TASKS:                                              │
│ • Separate column or label                                │
│ • Don't count against velocity                            │
│ • Track for workload visibility                           │
│                                                             │
│ SPRINT ON-CALL BUDGET:                                      │
│ Reserve 10% of capacity for unexpected incidents         │
│ If not used, pull additional work                         │
└─────────────────────────────────────────────────────────────┘

On-Call Handoff

ROTATION HANDOFF PROCESS:
┌─────────────────────────────────────────────────────────────┐
│                                                             │
│ BEFORE SHIFT ENDS:                                          │
│                                                             │
│ OUTGOING ON-CALL PROVIDES:                                  │
│                                                             │
│ 1. ACTIVE ISSUES                                           │
│    • Open incidents                                        │
│    • Ongoing problems                                      │
│    • Things to watch                                       │
│                                                             │
│ 2. RECENT CHANGES                                           │
│    • Deploys this week                                     │
│    • Config changes                                        │
│    • New features launched                                │
│                                                             │
│ 3. UPCOMING RISKS                                           │
│    • Planned maintenance                                   │
│    • Large customer events                                │
│    • Known risky deploys                                  │
│                                                             │
│ 4. LEARNINGS                                                │
│    • New runbook entries                                  │
│    • Gotchas discovered                                    │
│    • Helpful tips                                         │
│                                                             │
│ ─────────────────────────────────────────────────────────── │
│                                                             │
│ HANDOFF MEETING (15 min):                                   │
│                                                             │
│ • Sync call or written handoff                            │
│ • Document in Slack/wiki                                  │
│ • Confirm incoming has access                             │
│ • Transfer pager/alert routing                            │
│ • Incoming confirms ready                                 │
│                                                             │
│ GITSCRUM:                                                   │
│ Handoff note linked to on-call rotation record            │
└─────────────────────────────────────────────────────────────┘

Sustainability

Preventing Burnout

SUSTAINABLE ON-CALL PRACTICES:
┌─────────────────────────────────────────────────────────────┐
│                                                             │
│ METRICS TO MONITOR:                                         │
│                                                             │
│ Pages per shift:                                           │
│ Target: < 5 per week                                      │
│ Current: 8 ⚠️ (investigate noisy alerts)                  │
│                                                             │
│ Off-hours pages:                                           │
│ Target: < 2 per week                                      │
│ Current: 4 ⚠️ (prioritize automation)                     │
│                                                             │
│ False positive rate:                                       │
│ Target: < 10%                                             │
│ Current: 25% ❌ (fix alerting)                            │
│                                                             │
│ ─────────────────────────────────────────────────────────── │
│                                                             │
│ REDUCING ON-CALL BURDEN:                                    │
│                                                             │
│ FIX NOISY ALERTS:                                           │
│ • Review all alerts quarterly                             │
│ • Delete or fix low-value alerts                         │
│ • Tune thresholds                                         │
│ • Add automation for common fixes                         │
│                                                             │
│ IMPROVE RELIABILITY:                                        │
│ • Fix root causes, not just symptoms                     │
│ • Invest in infrastructure                                │
│ • Better testing and canaries                             │
│ • Chaos engineering                                        │
│                                                             │
│ SUPPORT ON-CALL:                                            │
│ • Compensate fairly                                       │
│ • Allow time recovery after heavy shifts                 │
│ • Celebrate quiet weeks                                   │
│ • Leadership does on-call too                             │
│                                                             │
│ TEAM SIZE:                                                  │
│ Minimum 4-5 people for sustainable rotation               │
│ Smaller = too frequent shifts                             │
└─────────────────────────────────────────────────────────────┘