On-Call Rotations | Sustainable Schedules & Runbooks
Create fair on-call schedules with rotation management and runbook documentation in GitScrum. Track incidents, escalation paths, and prevent team burnout.
15 min read
On-call rotations distribute the responsibility of responding to production incidents across team members, ensuring systems are monitored around the clock while preventing any single person from bearing the entire burden. GitScrum's team management, NoteVault documentation, and task assignment features help teams organize fair rotations, maintain accessible runbooks, track incident workload, and continuously improve on-call processes based on real experience.
Rotation Design
Schedule Patterns
ON-CALL SCHEDULE OPTIONS:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β ROTATION PATTERNS β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β WEEKLY ROTATION: β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β Week 1: @maria (primary), @carlos (secondary) ββ
β β Week 2: @carlos (primary), @ana (secondary) ββ
β β Week 3: @ana (primary), @pedro (secondary) ββ
β β Week 4: @pedro (primary), @maria (secondary) ββ
β β ββ
β β Handoff: Monday 9am ββ
β β ββ
β β Pros: Long enough to get context, fewer handoffs ββ
β β Cons: Full week can be draining if busy ββ
β β ββ
β β Best for: Smaller teams, lower incident volume ββ
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β DAILY ROTATION: β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β Mon: @maria β Tue: @carlos β Wed: @ana β ββ
β β Thu: @pedro β Fri: @maria β Weekend: @carlos ββ
β β ββ
β β Handoff: 9am each day ββ
β β ββ
β β Pros: Shorter burden, more balanced ββ
β β Cons: Many handoffs, context switching ββ
β β ββ
β β Best for: High-incident environments ββ
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β FOLLOW-THE-SUN: β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β Americas (9am-5pm EST): @maria, @carlos ββ
β β Europe (9am-5pm CET): @ana, @pedro ββ
β β Asia (9am-5pm JST): @yuki, @lei ββ
β β ββ
β β Handoff: At region shift change ββ
β β ββ
β β Pros: No night pages, work-hours only ββ
β β Cons: Requires distributed team ββ
β β ββ
β β Best for: Global teams ββ
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β HYBRID (Business/After-Hours): β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β Business hours (9am-6pm): Whole team responds ββ
β β After hours + weekends: Dedicated on-call person ββ
β β ββ
β β Pros: Distributes day load, focused night rotation ββ
β β Cons: Needs clear escalation rules ββ
β β ββ
β β Best for: Teams with predictable work-hours issues ββ
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Team Considerations
FAIR ROTATION DESIGN:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β BUILDING SUSTAINABLE SCHEDULES β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β MINIMUM TEAM SIZE: β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β For 24/7 coverage without burnout: ββ
β β ββ
β β β’ 4 people minimum: 1 week per month each ββ
β β β’ 6 people better: ~6 days per month each ββ
β β β’ 8 people ideal: 1 week every 2 months ββ
β β ββ
β β Rule: No one should be on-call > 25% of time ββ
β β ββ
β β If team too small: ββ
β β β’ Share rotation across teams ββ
β β β’ Consider on-call as paid overtime ββ
β β β’ Invest in reducing incidents ββ
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β EXPERIENCE DISTRIBUTION: β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β Pairing junior + senior: ββ
β β ββ
β β Week 1: @senior-maria (primary), @junior-tom (shadow) ββ
β β Week 2: @junior-tom (primary), @senior-carlos (backup) ββ
β β ββ
β β Progression path: ββ
β β 1. Shadow (observe, learn) ββ
β β 2. Primary with senior backup ββ
β β 3. Full primary ββ
β β ββ
β β Never: Junior alone without escalation path ββ
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β ACCOMMODATIONS: β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β Support different needs: ββ
β β ββ
β β β’ Parents with young kids: Avoid overnight shifts ββ
β β β’ Timezone constraints: Match to working hours ββ
β β β’ Vacation/holidays: Plan swaps in advance ββ
β β β’ Health/mental health: Opt-out without stigma ββ
β β ββ
β β Track in GitScrum: ββ
β β β’ Note availability constraints in user settings ββ
β β β’ Use team calendar for visibility ββ
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Tracking in GitScrum
Schedule Management
ORGANIZING ON-CALL IN GITSCRUM:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β TRACKING ROTATIONS β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β CURRENT ON-CALL VISIBILITY: β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β NoteVault: "On-Call Schedule" ββ
β β ββ
β β # Current On-Call ββ
β β ββ
β β **This Week (Dec 16-22):** ββ
β β - Primary: @maria ββ
β β - Secondary: @carlos ββ
β β ββ
β β **Next Week (Dec 23-29):** ββ
β β - Primary: @carlos ββ
β β - Secondary: @ana ββ
β β ββ
β β ## Full Rotation ββ
β β | Week | Primary | Secondary | ββ
β β |-----------|---------|-----------| ββ
β β | Dec 16-22 | Maria | Carlos | ββ
β β | Dec 23-29 | Carlos | Ana | ββ
β β | Dec 30-Jan 5 | Ana | Pedro | ββ
β β | Jan 6-12 | Pedro | Maria | ββ
β β ββ
β β Pin this note to project for easy access ββ
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β SHIFT HANDOFF CHECKLIST: β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β Create recurring task: "On-Call Handoff" ββ
β β Every Monday at 9am ββ
β β ββ
β β Checklist: ββ
β β β Outgoing: Post handoff summary in Discussions ββ
β β β Outgoing: Note any ongoing issues ββ
β β β Incoming: Confirm pager/phone is working ββ
β β β Incoming: Review recent incidents ββ
β β β Incoming: Check scheduled maintenance windows ββ
β β β Both: Acknowledge handoff in #on-call channel ββ
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β SWAP REQUESTS: β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β Task: "On-Call Swap Request" ββ
β β ββ
β β Template: ββ
β β Requester: @maria ββ
β β Shift: Dec 23-29 ββ
β β Reason: Holiday travel ββ
β β Proposed swap with: @pedro (week of Jan 6) ββ
β β Status: Pending approval ββ
β β ββ
β β Use labels: oncall/swap-request ββ
β β Assign to: Rotation manager ββ
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Incident Load Tracking
MEASURING ON-CALL BURDEN:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β WORKLOAD VISIBILITY β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β INCIDENT LOGGING: β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β Every incident = task with labels: ββ
β β ββ
β β Task: "Incident: Database connection timeout" ββ
β β Labels: ββ
β β β’ type/incident ββ
β β β’ severity/P2 ββ
β β β’ oncall/after-hours (or oncall/business-hours) ββ
β β ββ
β β Assigned to: On-call person who responded ββ
β β ββ
β β Track in description: ββ
β β β’ Time paged: 2:34 AM ββ
β β β’ Time acknowledged: 2:37 AM ββ
β β β’ Time resolved: 3:15 AM ββ
β β β’ Sleep disruption: Yes ββ
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β WEEKLY SUMMARY: β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β Post in Discussions after each rotation: ββ
β β ββ
β β "On-Call Summary: Week of Dec 16" ββ
β β On-call: @maria ββ
β β ββ
β β Stats: ββ
β β β’ Total incidents: 5 ββ
β β β’ After-hours pages: 2 ββ
β β β’ Sleep disruptions: 1 (3am, Tuesday) ββ
β β β’ Time spent: ~4 hours ββ
β β ββ
β β Notable issues: ββ
β β β’ Payment service timeouts (link to incident) ββ
β β ββ
β β Improvements needed: ββ
β β β’ Update runbook for payment issues ββ
β β β’ Add alert for connection pool exhaustion ββ
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β MONTHLY METRICS: β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β Track over time: ββ
β β ββ
β β Person β Shifts β Pages β After-hrs β Avg Time ββ
β β ββββββββββββΌβββββββββΌββββββββΌββββββββββββΌββββββββββ ββ
β β Maria β 4 β 12 β 3 β 1.5h ββ
β β Carlos β 4 β 18 β 7 β 2.1h ββ
β β Ana β 4 β 8 β 2 β 1.0h ββ
β β Pedro β 4 β 15 β 5 β 1.8h ββ
β β ββ
β β Investigate: Why is Carlos getting more pages? ββ
β β (Maybe shift timing aligns with problematic window) ββ
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Runbook Management
Documentation Structure
RUNBOOK ORGANIZATION:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β ON-CALL RUNBOOKS IN NOTEVAULT β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β FOLDER STRUCTURE: β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β Runbooks/ ββ
β β βββ Getting Started.md ββ
β β βββ Alert Reference/ ββ
β β β βββ Database Alerts.md ββ
β β β βββ API Alerts.md ββ
β β β βββ Payment Alerts.md ββ
β β β βββ Infrastructure Alerts.md ββ
β β βββ Common Procedures/ ββ
β β β βββ Restart Services.md ββ
β β β βββ Database Failover.md ββ
β β β βββ Rollback Deployment.md ββ
β β β βββ Escalation Guide.md ββ
β β βββ Post-Incident/ ββ
β β βββ Template.md ββ
β β βββ [incident reports...] ββ
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β RUNBOOK TEMPLATE: β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β # [Alert Name] Runbook ββ
β β ββ
β β ## What is this alert? ββ
β β Brief explanation of what triggered it. ββ
β β ββ
β β ## Who gets paged? ββ
β β Primary on-call, escalate to [team] if unresolved. ββ
β β ββ
β β ## Severity ββ
β β P2 - Service degraded but functional ββ
β β ββ
β β ## Quick check ββ
β β 1. Is this a false positive? Check [dashboard] ββ
β β 2. Is deployment in progress? Check [deploy status] ββ
β β ββ
β β ## Resolution steps ββ
β β 1. Step one with command examples ββ
β β 2. Step two with what to check ββ
β β 3. If X, do Y. If Z, escalate. ββ
β β ββ
β β ## Escalation ββ
β β If not resolved in 15 min, page secondary. ββ
β β If critical, page @manager. ββ
β β ββ
β β ## Recent incidents ββ
β β - Dec 15: [link to incident report] ββ
β β - Nov 28: [link to incident report] ββ
β β ββ
β β *Last updated: Dec 16, 2024 by @maria* ββ
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Keeping Runbooks Current
RUNBOOK MAINTENANCE:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β KEEPING DOCUMENTATION USEFUL β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β UPDATE TRIGGERS: β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β Runbook must be updated when: ββ
β β ββ
β β β Incident required procedure not in runbook ββ
β β β Runbook steps didn't work ββ
β β β New service deployed ββ
β β β Infrastructure changed ββ
β β β Alert thresholds modified ββ
β β ββ
β β Action: Create task "Update runbook: [name]" ββ
β β Assign to: Person who discovered gap ββ
β β Due: End of on-call rotation ββ
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β POST-INCIDENT REQUIREMENT: β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β Every incident review must answer: ββ
β β ββ
β β 1. Was there a runbook for this? ββ
β β - No β Create one ββ
β β - Yes β Did it help? ββ
β β ββ
β β 2. Would updated runbook have helped? ββ
β β - Yes β Add improvement to action items ββ
β β ββ
β β 3. Link runbook to incident report ββ
β β - Shows pattern of issues ββ
β β - Helps next person ββ
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β QUARTERLY REVIEW: β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β Scheduled task: "Runbook Review Q4" ββ
β β ββ
β β For each runbook: ββ
β β β Still accurate? ββ
β β β Commands still work? ββ
β β β Links still valid? ββ
β β β Escalation contacts current? ββ
β β β Referenced in any incidents recently? ββ
β β ββ
β β Archive obsolete runbooks (don't delete - history) ββ
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Escalation Paths
Clear Escalation Rules
ESCALATION STRUCTURE:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β WHEN AND HOW TO ESCALATE β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β ESCALATION TRIGGERS: β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β Escalate to SECONDARY when: ββ
β β β’ Primary unreachable for 10 minutes ββ
β β β’ Primary needs help (explicit request) ββ
β β β’ Issue outside primary's expertise ββ
β β ββ
β β Escalate to MANAGER when: ββ
β β β’ P1/Critical incident ββ
β β β’ Customer-facing outage > 30 minutes ββ
β β β’ Primary and secondary both stuck ββ
β β β’ Resource decisions needed (spend money, wake people) ββ
β β ββ
β β Escalate to EXECUTIVE when: ββ
β β β’ Major outage affecting revenue ββ
β β β’ Security incident with data exposure ββ
β β β’ External communication needed (press, customers) ββ
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β ESCALATION CONTACT LIST: β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β In NoteVault: "Escalation Contacts" ββ
β β ββ
β β PRIMARY ON-CALL: Check schedule ββ
β β SECONDARY: Check schedule ββ
β β ββ
β β TEAM LEADS (rotate monthly): ββ
β β β’ Dec: @manager-sarah - phone, Slack @sarah ββ
β β β’ Jan: @manager-tom - phone, Slack @tom ββ
β β ββ
β β EXECUTIVE (P1 only): ββ
β β β’ @cto-james - phone (emergencies only), Slack @james ββ
β β ββ
β β SPECIALISTS: ββ
β β β’ Database: @dba-lisa ββ
β β β’ Security: @security-mike ββ
β β β’ Infrastructure: @infra-pat ββ
β β ββ
β β Update this list monthly! ββ
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β ESCALATION ETIQUETTE: β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β When escalating: ββ
β β ββ
β β β
Provide context first: ββ
β β "P2 incident, payment timeouts for 15 min, ββ
β β tried X and Y, need help with database" ββ
β β ββ
β β β
Be specific about what you need: ββ
β β "Need someone with database access" not ββ
β β "Something's broken" ββ
β β ββ
β β β
Choose appropriate channel: ββ
β β β’ Slack first (can respond when ready) ββ
β β β’ Phone for P1 or no Slack response in 5 min ββ
β β ββ
β β β Don't: Page everyone at once ββ
β β β Don't: Escalate without trying first ββ
β β β Don't: Skip levels unless P1 ββ
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Sustainability
Preventing Burnout
HEALTHY ON-CALL CULTURE:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β KEEPING ON-CALL SUSTAINABLE β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β COMPENSATION: β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β On-call burden should be recognized: ββ
β β ββ
β β Options: ββ
β β β’ Paid standby time (flat rate per shift) ββ
β β β’ Paid per incident (especially after-hours) ββ
β β β’ Comp time (time off after heavy rotation) ββ
β β β’ On-call bonus in salary ββ
β β ββ
β β Minimum: Time off after disruptive night incidents ββ
β β Example: 3am page = start work late or leave early ββ
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β INCIDENT REDUCTION: β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β Best on-call is boring on-call: ββ
β β ββ
β β Track and fix: ββ
β β β’ Noisy alerts (tune thresholds) ββ
β β β’ Recurring incidents (fix root cause) ββ
β β β’ Flaky systems (improve reliability) ββ
β β β’ Missing automation (reduce manual steps) ββ
β β ββ
β β After each rotation: ββ
β β "What incidents could we prevent?" ββ
β β ββ
β β Create improvement tasks, prioritize them ββ
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β PROTECTED TIME: β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β During on-call week: ββ
β β ββ
β β β
Reduced sprint commitments (50-70% normal capacity) ββ
β β β
No meetings during night-call recovery ββ
β β β
Permission to work from home ββ
β β β
Time allocated for runbook improvements ββ
β β ββ
β β Don't expect on-call person to deliver full sprint ββ
β β AND handle incidents AND be well-rested ββ
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β FEEDBACK LOOP: β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β Monthly on-call retro: ββ
β β ββ
β β β’ How was on-call this month? ββ
β β β’ Any particularly bad shifts? ββ
β β β’ Runbook gaps discovered? ββ
β β β’ Alert tuning needed? ββ
β β β’ Schedule adjustments? ββ
β β ββ
β β Create action items in GitScrum ββ
β β Assign owners, track completion ββ
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ