15 min read • Guide 107 of 877
Setting Up Effective On-Call Rotations
On-call rotations distribute the responsibility of responding to production incidents across team members, ensuring systems are monitored around the clock while preventing any single person from bearing the entire burden. GitScrum's team management, NoteVault documentation, and task assignment features help teams organize fair rotations, maintain accessible runbooks, track incident workload, and continuously improve on-call processes based on real experience.
Rotation Design
Schedule Patterns
ON-CALL SCHEDULE OPTIONS:
┌─────────────────────────────────────────────────────────────┐
│ ROTATION PATTERNS │
├─────────────────────────────────────────────────────────────┤
│ │
│ WEEKLY ROTATION: │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ Week 1: @maria (primary), @carlos (secondary) ││
│ │ Week 2: @carlos (primary), @ana (secondary) ││
│ │ Week 3: @ana (primary), @pedro (secondary) ││
│ │ Week 4: @pedro (primary), @maria (secondary) ││
│ │ ││
│ │ Handoff: Monday 9am ││
│ │ ││
│ │ Pros: Long enough to get context, fewer handoffs ││
│ │ Cons: Full week can be draining if busy ││
│ │ ││
│ │ Best for: Smaller teams, lower incident volume ││
│ └─────────────────────────────────────────────────────────┘│
│ │
│ DAILY ROTATION: │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ Mon: @maria → Tue: @carlos → Wed: @ana → ││
│ │ Thu: @pedro → Fri: @maria → Weekend: @carlos ││
│ │ ││
│ │ Handoff: 9am each day ││
│ │ ││
│ │ Pros: Shorter burden, more balanced ││
│ │ Cons: Many handoffs, context switching ││
│ │ ││
│ │ Best for: High-incident environments ││
│ └─────────────────────────────────────────────────────────┘│
│ │
│ FOLLOW-THE-SUN: │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ Americas (9am-5pm EST): @maria, @carlos ││
│ │ Europe (9am-5pm CET): @ana, @pedro ││
│ │ Asia (9am-5pm JST): @yuki, @lei ││
│ │ ││
│ │ Handoff: At region shift change ││
│ │ ││
│ │ Pros: No night pages, work-hours only ││
│ │ Cons: Requires distributed team ││
│ │ ││
│ │ Best for: Global teams ││
│ └─────────────────────────────────────────────────────────┘│
│ │
│ HYBRID (Business/After-Hours): │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ Business hours (9am-6pm): Whole team responds ││
│ │ After hours + weekends: Dedicated on-call person ││
│ │ ││
│ │ Pros: Distributes day load, focused night rotation ││
│ │ Cons: Needs clear escalation rules ││
│ │ ││
│ │ Best for: Teams with predictable work-hours issues ││
│ └─────────────────────────────────────────────────────────┘│
│ │
└─────────────────────────────────────────────────────────────┘
Team Considerations
FAIR ROTATION DESIGN:
┌─────────────────────────────────────────────────────────────┐
│ BUILDING SUSTAINABLE SCHEDULES │
├─────────────────────────────────────────────────────────────┤
│ │
│ MINIMUM TEAM SIZE: │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ For 24/7 coverage without burnout: ││
│ │ ││
│ │ • 4 people minimum: 1 week per month each ││
│ │ • 6 people better: ~6 days per month each ││
│ │ • 8 people ideal: 1 week every 2 months ││
│ │ ││
│ │ Rule: No one should be on-call > 25% of time ││
│ │ ││
│ │ If team too small: ││
│ │ • Share rotation across teams ││
│ │ • Consider on-call as paid overtime ││
│ │ • Invest in reducing incidents ││
│ └─────────────────────────────────────────────────────────┘│
│ │
│ EXPERIENCE DISTRIBUTION: │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ Pairing junior + senior: ││
│ │ ││
│ │ Week 1: @senior-maria (primary), @junior-tom (shadow) ││
│ │ Week 2: @junior-tom (primary), @senior-carlos (backup) ││
│ │ ││
│ │ Progression path: ││
│ │ 1. Shadow (observe, learn) ││
│ │ 2. Primary with senior backup ││
│ │ 3. Full primary ││
│ │ ││
│ │ Never: Junior alone without escalation path ││
│ └─────────────────────────────────────────────────────────┘│
│ │
│ ACCOMMODATIONS: │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ Support different needs: ││
│ │ ││
│ │ • Parents with young kids: Avoid overnight shifts ││
│ │ • Timezone constraints: Match to working hours ││
│ │ • Vacation/holidays: Plan swaps in advance ││
│ │ • Health/mental health: Opt-out without stigma ││
│ │ ││
│ │ Track in GitScrum: ││
│ │ • Note availability constraints in user settings ││
│ │ • Use team calendar for visibility ││
│ └─────────────────────────────────────────────────────────┘│
│ │
└─────────────────────────────────────────────────────────────┘
Tracking in GitScrum
Schedule Management
ORGANIZING ON-CALL IN GITSCRUM:
┌─────────────────────────────────────────────────────────────┐
│ TRACKING ROTATIONS │
├─────────────────────────────────────────────────────────────┤
│ │
│ CURRENT ON-CALL VISIBILITY: │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ NoteVault: "On-Call Schedule" ││
│ │ ││
│ │ # Current On-Call ││
│ │ ││
│ │ **This Week (Dec 16-22):** ││
│ │ - Primary: @maria ││
│ │ - Secondary: @carlos ││
│ │ ││
│ │ **Next Week (Dec 23-29):** ││
│ │ - Primary: @carlos ││
│ │ - Secondary: @ana ││
│ │ ││
│ │ ## Full Rotation ││
│ │ | Week | Primary | Secondary | ││
│ │ |-----------|---------|-----------| ││
│ │ | Dec 16-22 | Maria | Carlos | ││
│ │ | Dec 23-29 | Carlos | Ana | ││
│ │ | Dec 30-Jan 5 | Ana | Pedro | ││
│ │ | Jan 6-12 | Pedro | Maria | ││
│ │ ││
│ │ Pin this note to project for easy access ││
│ └─────────────────────────────────────────────────────────┘│
│ │
│ SHIFT HANDOFF CHECKLIST: │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ Create recurring task: "On-Call Handoff" ││
│ │ Every Monday at 9am ││
│ │ ││
│ │ Checklist: ││
│ │ ☐ Outgoing: Post handoff summary in Discussions ││
│ │ ☐ Outgoing: Note any ongoing issues ││
│ │ ☐ Incoming: Confirm pager/phone is working ││
│ │ ☐ Incoming: Review recent incidents ││
│ │ ☐ Incoming: Check scheduled maintenance windows ││
│ │ ☐ Both: Acknowledge handoff in #on-call channel ││
│ └─────────────────────────────────────────────────────────┘│
│ │
│ SWAP REQUESTS: │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ Task: "On-Call Swap Request" ││
│ │ ││
│ │ Template: ││
│ │ Requester: @maria ││
│ │ Shift: Dec 23-29 ││
│ │ Reason: Holiday travel ││
│ │ Proposed swap with: @pedro (week of Jan 6) ││
│ │ Status: Pending approval ││
│ │ ││
│ │ Use labels: oncall/swap-request ││
│ │ Assign to: Rotation manager ││
│ └─────────────────────────────────────────────────────────┘│
│ │
└─────────────────────────────────────────────────────────────┘
Incident Load Tracking
MEASURING ON-CALL BURDEN:
┌─────────────────────────────────────────────────────────────┐
│ WORKLOAD VISIBILITY │
├─────────────────────────────────────────────────────────────┤
│ │
│ INCIDENT LOGGING: │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ Every incident = task with labels: ││
│ │ ││
│ │ Task: "Incident: Database connection timeout" ││
│ │ Labels: ││
│ │ • type/incident ││
│ │ • severity/P2 ││
│ │ • oncall/after-hours (or oncall/business-hours) ││
│ │ ││
│ │ Assigned to: On-call person who responded ││
│ │ ││
│ │ Track in description: ││
│ │ • Time paged: 2:34 AM ││
│ │ • Time acknowledged: 2:37 AM ││
│ │ • Time resolved: 3:15 AM ││
│ │ • Sleep disruption: Yes ││
│ └─────────────────────────────────────────────────────────┘│
│ │
│ WEEKLY SUMMARY: │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ Post in Discussions after each rotation: ││
│ │ ││
│ │ "On-Call Summary: Week of Dec 16" ││
│ │ On-call: @maria ││
│ │ ││
│ │ Stats: ││
│ │ • Total incidents: 5 ││
│ │ • After-hours pages: 2 ││
│ │ • Sleep disruptions: 1 (3am, Tuesday) ││
│ │ • Time spent: ~4 hours ││
│ │ ││
│ │ Notable issues: ││
│ │ • Payment service timeouts (link to incident) ││
│ │ ││
│ │ Improvements needed: ││
│ │ • Update runbook for payment issues ││
│ │ • Add alert for connection pool exhaustion ││
│ └─────────────────────────────────────────────────────────┘│
│ │
│ MONTHLY METRICS: │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ Track over time: ││
│ │ ││
│ │ Person │ Shifts │ Pages │ After-hrs │ Avg Time ││
│ │ ───────────┼────────┼───────┼───────────┼────────── ││
│ │ Maria │ 4 │ 12 │ 3 │ 1.5h ││
│ │ Carlos │ 4 │ 18 │ 7 │ 2.1h ││
│ │ Ana │ 4 │ 8 │ 2 │ 1.0h ││
│ │ Pedro │ 4 │ 15 │ 5 │ 1.8h ││
│ │ ││
│ │ Investigate: Why is Carlos getting more pages? ││
│ │ (Maybe shift timing aligns with problematic window) ││
│ └─────────────────────────────────────────────────────────┘│
│ │
└─────────────────────────────────────────────────────────────┘
Runbook Management
Documentation Structure
RUNBOOK ORGANIZATION:
┌─────────────────────────────────────────────────────────────┐
│ ON-CALL RUNBOOKS IN NOTEVAULT │
├─────────────────────────────────────────────────────────────┤
│ │
│ FOLDER STRUCTURE: │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ Runbooks/ ││
│ │ ├── Getting Started.md ││
│ │ ├── Alert Reference/ ││
│ │ │ ├── Database Alerts.md ││
│ │ │ ├── API Alerts.md ││
│ │ │ ├── Payment Alerts.md ││
│ │ │ └── Infrastructure Alerts.md ││
│ │ ├── Common Procedures/ ││
│ │ │ ├── Restart Services.md ││
│ │ │ ├── Database Failover.md ││
│ │ │ ├── Rollback Deployment.md ││
│ │ │ └── Escalation Guide.md ││
│ │ └── Post-Incident/ ││
│ │ ├── Template.md ││
│ │ └── [incident reports...] ││
│ └─────────────────────────────────────────────────────────┘│
│ │
│ RUNBOOK TEMPLATE: │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ # [Alert Name] Runbook ││
│ │ ││
│ │ ## What is this alert? ││
│ │ Brief explanation of what triggered it. ││
│ │ ││
│ │ ## Who gets paged? ││
│ │ Primary on-call, escalate to [team] if unresolved. ││
│ │ ││
│ │ ## Severity ││
│ │ P2 - Service degraded but functional ││
│ │ ││
│ │ ## Quick check ││
│ │ 1. Is this a false positive? Check [dashboard] ││
│ │ 2. Is deployment in progress? Check [deploy status] ││
│ │ ││
│ │ ## Resolution steps ││
│ │ 1. Step one with command examples ││
│ │ 2. Step two with what to check ││
│ │ 3. If X, do Y. If Z, escalate. ││
│ │ ││
│ │ ## Escalation ││
│ │ If not resolved in 15 min, page secondary. ││
│ │ If critical, page @manager. ││
│ │ ││
│ │ ## Recent incidents ││
│ │ - Dec 15: [link to incident report] ││
│ │ - Nov 28: [link to incident report] ││
│ │ ││
│ │ *Last updated: Dec 16, 2024 by @maria* ││
│ └─────────────────────────────────────────────────────────┘│
│ │
└─────────────────────────────────────────────────────────────┘
Keeping Runbooks Current
RUNBOOK MAINTENANCE:
┌─────────────────────────────────────────────────────────────┐
│ KEEPING DOCUMENTATION USEFUL │
├─────────────────────────────────────────────────────────────┤
│ │
│ UPDATE TRIGGERS: │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ Runbook must be updated when: ││
│ │ ││
│ │ ☐ Incident required procedure not in runbook ││
│ │ ☐ Runbook steps didn't work ││
│ │ ☐ New service deployed ││
│ │ ☐ Infrastructure changed ││
│ │ ☐ Alert thresholds modified ││
│ │ ││
│ │ Action: Create task "Update runbook: [name]" ││
│ │ Assign to: Person who discovered gap ││
│ │ Due: End of on-call rotation ││
│ └─────────────────────────────────────────────────────────┘│
│ │
│ POST-INCIDENT REQUIREMENT: │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ Every incident review must answer: ││
│ │ ││
│ │ 1. Was there a runbook for this? ││
│ │ - No → Create one ││
│ │ - Yes → Did it help? ││
│ │ ││
│ │ 2. Would updated runbook have helped? ││
│ │ - Yes → Add improvement to action items ││
│ │ ││
│ │ 3. Link runbook to incident report ││
│ │ - Shows pattern of issues ││
│ │ - Helps next person ││
│ └─────────────────────────────────────────────────────────┘│
│ │
│ QUARTERLY REVIEW: │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ Scheduled task: "Runbook Review Q4" ││
│ │ ││
│ │ For each runbook: ││
│ │ ☐ Still accurate? ││
│ │ ☐ Commands still work? ││
│ │ ☐ Links still valid? ││
│ │ ☐ Escalation contacts current? ││
│ │ ☐ Referenced in any incidents recently? ││
│ │ ││
│ │ Archive obsolete runbooks (don't delete - history) ││
│ └─────────────────────────────────────────────────────────┘│
│ │
└─────────────────────────────────────────────────────────────┘
Escalation Paths
Clear Escalation Rules
ESCALATION STRUCTURE:
┌─────────────────────────────────────────────────────────────┐
│ WHEN AND HOW TO ESCALATE │
├─────────────────────────────────────────────────────────────┤
│ │
│ ESCALATION TRIGGERS: │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ Escalate to SECONDARY when: ││
│ │ • Primary unreachable for 10 minutes ││
│ │ • Primary needs help (explicit request) ││
│ │ • Issue outside primary's expertise ││
│ │ ││
│ │ Escalate to MANAGER when: ││
│ │ • P1/Critical incident ││
│ │ • Customer-facing outage > 30 minutes ││
│ │ • Primary and secondary both stuck ││
│ │ • Resource decisions needed (spend money, wake people) ││
│ │ ││
│ │ Escalate to EXECUTIVE when: ││
│ │ • Major outage affecting revenue ││
│ │ • Security incident with data exposure ││
│ │ • External communication needed (press, customers) ││
│ └─────────────────────────────────────────────────────────┘│
│ │
│ ESCALATION CONTACT LIST: │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ In NoteVault: "Escalation Contacts" ││
│ │ ││
│ │ PRIMARY ON-CALL: Check schedule ││
│ │ SECONDARY: Check schedule ││
│ │ ││
│ │ TEAM LEADS (rotate monthly): ││
│ │ • Dec: @manager-sarah - phone, Slack @sarah ││
│ │ • Jan: @manager-tom - phone, Slack @tom ││
│ │ ││
│ │ EXECUTIVE (P1 only): ││
│ │ • @cto-james - phone (emergencies only), Slack @james ││
│ │ ││
│ │ SPECIALISTS: ││
│ │ • Database: @dba-lisa ││
│ │ • Security: @security-mike ││
│ │ • Infrastructure: @infra-pat ││
│ │ ││
│ │ Update this list monthly! ││
│ └─────────────────────────────────────────────────────────┘│
│ │
│ ESCALATION ETIQUETTE: │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ When escalating: ││
│ │ ││
│ │ ✅ Provide context first: ││
│ │ "P2 incident, payment timeouts for 15 min, ││
│ │ tried X and Y, need help with database" ││
│ │ ││
│ │ ✅ Be specific about what you need: ││
│ │ "Need someone with database access" not ││
│ │ "Something's broken" ││
│ │ ││
│ │ ✅ Choose appropriate channel: ││
│ │ • Slack first (can respond when ready) ││
│ │ • Phone for P1 or no Slack response in 5 min ││
│ │ ││
│ │ ❌ Don't: Page everyone at once ││
│ │ ❌ Don't: Escalate without trying first ││
│ │ ❌ Don't: Skip levels unless P1 ││
│ └─────────────────────────────────────────────────────────┘│
│ │
└─────────────────────────────────────────────────────────────┘
Sustainability
Preventing Burnout
HEALTHY ON-CALL CULTURE:
┌─────────────────────────────────────────────────────────────┐
│ KEEPING ON-CALL SUSTAINABLE │
├─────────────────────────────────────────────────────────────┤
│ │
│ COMPENSATION: │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ On-call burden should be recognized: ││
│ │ ││
│ │ Options: ││
│ │ • Paid standby time (flat rate per shift) ││
│ │ • Paid per incident (especially after-hours) ││
│ │ • Comp time (time off after heavy rotation) ││
│ │ • On-call bonus in salary ││
│ │ ││
│ │ Minimum: Time off after disruptive night incidents ││
│ │ Example: 3am page = start work late or leave early ││
│ └─────────────────────────────────────────────────────────┘│
│ │
│ INCIDENT REDUCTION: │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ Best on-call is boring on-call: ││
│ │ ││
│ │ Track and fix: ││
│ │ • Noisy alerts (tune thresholds) ││
│ │ • Recurring incidents (fix root cause) ││
│ │ • Flaky systems (improve reliability) ││
│ │ • Missing automation (reduce manual steps) ││
│ │ ││
│ │ After each rotation: ││
│ │ "What incidents could we prevent?" ││
│ │ ││
│ │ Create improvement tasks, prioritize them ││
│ └─────────────────────────────────────────────────────────┘│
│ │
│ PROTECTED TIME: │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ During on-call week: ││
│ │ ││
│ │ ✅ Reduced sprint commitments (50-70% normal capacity) ││
│ │ ✅ No meetings during night-call recovery ││
│ │ ✅ Permission to work from home ││
│ │ ✅ Time allocated for runbook improvements ││
│ │ ││
│ │ Don't expect on-call person to deliver full sprint ││
│ │ AND handle incidents AND be well-rested ││
│ └─────────────────────────────────────────────────────────┘│
│ │
│ FEEDBACK LOOP: │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ Monthly on-call retro: ││
│ │ ││
│ │ • How was on-call this month? ││
│ │ • Any particularly bad shifts? ││
│ │ • Runbook gaps discovered? ││
│ │ • Alert tuning needed? ││
│ │ • Schedule adjustments? ││
│ │ ││
│ │ Create action items in GitScrum ││
│ │ Assign owners, track completion ││
│ └─────────────────────────────────────────────────────────┘│
│ │
└─────────────────────────────────────────────────────────────┘