On-Call Management with GitScrum | Rotations & Runbooks
Run fair on-call rotations without burnout. Set escalation paths, create runbooks, adjust sprint capacity, and manage handoffs with GitScrum.
9 min read
On-call shouldn't burn out your team. GitScrum helps manage on-call tasks, track incidents, and ensure fair rotation of responsibilities.
On-Call Fundamentals
Why On-Call Exists
ON-CALL PURPOSE AND GOALS:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β PURPOSE: β
β Ensure someone is always available to respond to β
β production issues that affect users. β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β HEALTHY ON-CALL: β
β β
β β
Fair rotation among team β
β β
Clear expectations and runbooks β
β β
Appropriate compensation β
β β
Low false-alarm rate β
β β
Sustainable workload β
β β
Learning from incidents β
β β
β UNHEALTHY ON-CALL: β
β β
β β Same people always on-call β
β β Constant false alarms (alert fatigue) β
β β No documentation or runbooks β
β β Expected to fix everything alone β
β β No time compensation β
β β Burnout and attrition β
β β
β ON-CALL PRINCIPLE: β
β If you build it, you run it β
β Team owns their services end-to-end β
β Creates accountability and better design β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Rotation Setup
Building the Schedule
ON-CALL ROTATION DESIGN:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β ROTATION OPTIONS: β
β β
β WEEKLY ROTATION: β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β Week 1: @alex (primary) / @maria (secondary) ββ
β β Week 2: @maria (primary) / @jordan (secondary) ββ
β β Week 3: @jordan (primary) / @chen (secondary) ββ
β β Week 4: @chen (primary) / @alex (secondary) ββ
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β Best for: Larger teams, lower page volume β
β β
β DAILY ROTATION: β
β Each person on-call for 1-2 days β
β Best for: High page volume, more people needed β
β β
β FOLLOW-THE-SUN: β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β 00:00-08:00 UTC: APAC team ββ
β β 08:00-16:00 UTC: EMEA team ββ
β β 16:00-00:00 UTC: Americas team ββ
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Best for: Global teams, no overnight shifts β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β FAIRNESS RULES: β
β β
β β’ Rotate evenly (track in spreadsheet) β
β β’ No back-to-back shifts without consent β
β β’ Holiday/weekend on-call = extra consideration β
β β’ Allow shift swaps β
β β’ New team members shadow before solo β
β β’ Compensate appropriately β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Escalation Path
ESCALATION STRUCTURE:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β LEVEL 1: Primary On-Call β
β β (if no response in 15 min) β
β β
β LEVEL 2: Secondary On-Call β
β β (if no response in 15 min) β
β β
β LEVEL 3: Team Lead / Manager β
β β (if SEV 1 or business impact) β
β β
β LEVEL 4: VP Engineering / Executive β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β ESCALATION RULES: β
β β
β AUTO-ESCALATE WHEN: β
β β’ No acknowledgment in 15 minutes β
β β’ Incident duration > 30 minutes β
β β’ SEV 1 (always notify leadership) β
β β’ Customer-reported issue β
β β
β PRIMARY SHOULD ESCALATE WHEN: β
β β’ Outside their expertise β
β β’ Need additional hands β
β β’ Impact growing β
β β’ Unable to resolve alone β
β β
β DON'T: β
β β’ Hesitate to escalate β
β β’ Try to be a hero β
β β’ Wait too long β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
On-Call Readiness
Runbooks
RUNBOOK ESSENTIALS:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β EVERY SERVICE NEEDS RUNBOOKS: β
β β
β RUNBOOK TEMPLATE: β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β Service: Payment API ββ
β β Owner: Payments Team ββ
β β On-call: payments-oncall@company.com ββ
β β ββ
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ ββ
β β ββ
β β COMMON ALERTS: ββ
β β ββ
β β Alert: payment-api-error-rate-high ββ
β β βββββββββββββββββββββββββββββββββββββββββ ββ
β β What: Error rate > 5% ββ
β β Why: Payment requests failing ββ
β β ββ
β β Steps: ββ
β β 1. Check dashboard: [link] ββ
β β 2. Check recent deploys: [link] ββ
β β 3. Check downstream: Stripe status [link] ββ
β β 4. Check database: connection pool [link] ββ
β β ββ
β β Common causes: ββ
β β β’ Stripe outage β wait or failover ββ
β β β’ Database full β scale or clean ββ
β β β’ Bad deploy β rollback ββ
β β ββ
β β Rollback: kubectl rollout undo deploy/payment-api ββ
β β ββ
β β Escalate if: Unable to identify cause in 30 min ββ
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β KEEP RUNBOOKS: β
β β’ Up to date (review quarterly) β
β β’ Accessible (not behind VPN) β
β β’ Actionable (steps, not just info) β
β β’ Linked in alerts β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
On-Call Toolkit
ON-CALL ENGINEER NEEDS:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β ACCESS (Verified before rotation starts): β
β β
β β Monitoring dashboards (Datadog, Grafana, etc.) β
β β Alerting system (PagerDuty, Opsgenie, etc.) β
β β Cloud console (AWS, GCP, Azure) β
β β Deployment tools (kubectl, CI/CD) β
β β Database access (read at minimum) β
β β Log aggregation (Splunk, ELK, etc.) β
β β Communication (Slack, incident channel) β
β β Status page (to post updates) β
β β Runbooks (wiki, Notion, etc.) β
β β
β DOCUMENTATION: β
β β
β β Runbooks for all services β
β β Architecture diagrams β
β β Escalation contacts β
β β Vendor support contacts β
β β Previous incident post-mortems β
β β
β TOOLS: β
β β
β β Laptop with VPN β
β β Mobile with alerting app β
β β Charger and backup battery β
β β Reliable internet (or backup hotspot) β
β β
β BEFORE SHIFT: β
β β’ Verify access to all tools β
β β’ Check pending incidents β
β β’ Review recent changes/deploys β
β β’ Confirm escalation contacts β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Sprint Planning with On-Call
Capacity Adjustment
ON-CALL CAPACITY IN GITSCRUM:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β SPRINT PLANNING CONSIDERATION: β
β β
β NORMAL CAPACITY: 28 points β
β β
β ON-CALL ADJUSTMENTS: β
β β
β Full week on-call: β
β β’ -25% to -50% capacity depending on page volume β
β β’ @jordan: 28 pts β 14-21 pts β
β β
β Half week on-call: β
β β’ -10% to -25% capacity β
β β’ @alex: 28 pts β 21-25 pts β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β SPRINT VIEW: β
β β
β @alex [βββββββββββββββββββββββββ] 24 pts (4 pts on-call)β
β @maria [βββββββββββββββββββββββββ] 28 pts β
β @jordan [βββββββββββββββββββββββββ] 18 pts (10 pts on-call)β
β @chen [βββββββββββββββββββββββββ] 28 pts β
β β
β Total: 98 pts (vs 112 standard) β
β β
β ON-CALL TASKS: β
β β’ Separate column or label β
β β’ Don't count against velocity β
β β’ Track for workload visibility β
β β
β SPRINT ON-CALL BUDGET: β
β Reserve 10% of capacity for unexpected incidents β
β If not used, pull additional work β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
On-Call Handoff
ROTATION HANDOFF PROCESS:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β BEFORE SHIFT ENDS: β
β β
β OUTGOING ON-CALL PROVIDES: β
β β
β 1. ACTIVE ISSUES β
β β’ Open incidents β
β β’ Ongoing problems β
β β’ Things to watch β
β β
β 2. RECENT CHANGES β
β β’ Deploys this week β
β β’ Config changes β
β β’ New features launched β
β β
β 3. UPCOMING RISKS β
β β’ Planned maintenance β
β β’ Large customer events β
β β’ Known risky deploys β
β β
β 4. LEARNINGS β
β β’ New runbook entries β
β β’ Gotchas discovered β
β β’ Helpful tips β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β HANDOFF MEETING (15 min): β
β β
β β’ Sync call or written handoff β
β β’ Document in Slack/wiki β
β β’ Confirm incoming has access β
β β’ Transfer pager/alert routing β
β β’ Incoming confirms ready β
β β
β GITSCRUM: β
β Handoff note linked to on-call rotation record β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Sustainability
Preventing Burnout
SUSTAINABLE ON-CALL PRACTICES:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β METRICS TO MONITOR: β
β β
β Pages per shift: β
β Target: < 5 per week β
β Current: 8 β οΈ (investigate noisy alerts) β
β β
β Off-hours pages: β
β Target: < 2 per week β
β Current: 4 β οΈ (prioritize automation) β
β β
β False positive rate: β
β Target: < 10% β
β Current: 25% β (fix alerting) β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β REDUCING ON-CALL BURDEN: β
β β
β FIX NOISY ALERTS: β
β β’ Review all alerts quarterly β
β β’ Delete or fix low-value alerts β
β β’ Tune thresholds β
β β’ Add automation for common fixes β
β β
β IMPROVE RELIABILITY: β
β β’ Fix root causes, not just symptoms β
β β’ Invest in infrastructure β
β β’ Better testing and canaries β
β β’ Chaos engineering β
β β
β SUPPORT ON-CALL: β
β β’ Compensate fairly β
β β’ Allow time recovery after heavy shifts β
β β’ Celebrate quiet weeks β
β β’ Leadership does on-call too β
β β
β TEAM SIZE: β
β Minimum 4-5 people for sustainable rotation β
β Smaller = too frequent shifts β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ