Incident Response for Dev Teams | Roles, Process, On-Call
Minimize downtime with structured incident response: severity levels, defined roles, clear communication, and blameless postmortems. GitScrum tracks incidents.
10 min read
When things break, process helps. GitScrum helps teams track incidents alongside development work, connecting fixes to their triggering events.
Incident Basics
What Is an Incident
INCIDENT DEFINITION:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β AN INCIDENT IS: β
β βββββββββββββββ β
β Unplanned interruption to service β
β Significant degradation of service quality β
β Breach of SLA or SLO β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β SEVERITY LEVELS: β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β SEV-1 (CRITICAL): ββ
β β β’ Complete service outage ββ
β β β’ Major feature broken for all users ββ
β β β’ Data loss or security breach ββ
β β Response: All hands, 24/7 ββ
β β ββ
β β SEV-2 (HIGH): ββ
β β β’ Partial outage or degradation ββ
β β β’ Major feature broken for some users ββ
β β β’ Workaround exists but painful ββ
β β Response: Immediate during work hours ββ
β β ββ
β β SEV-3 (MEDIUM): ββ
β β β’ Minor feature broken ββ
β β β’ Performance degraded but usable ββ
β β β’ Small subset of users affected ββ
β β Response: Next business day ββ
β β ββ
β β SEV-4 (LOW): ββ
β β β’ Cosmetic issues ββ
β β β’ Minor inconvenience ββ
β β Response: Scheduled work ββ
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β GOAL: Restore service ASAP, learn after β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Incident Phases
Response Workflow
INCIDENT LIFECYCLE:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β PHASE 1: DETECT β
β βββββββββββββββ β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β SOURCES: ββ
β β β’ Automated monitoring alerts ββ
β β β’ Customer reports ββ
β β β’ Internal team notices ββ
β β ββ
β β ACTION: ββ
β β Create incident record immediately ββ
β β Don't wait to confirm ββ
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β PHASE 2: TRIAGE β
β βββββββββββββββ β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β QUESTIONS: ββ
β β β’ What's the impact? ββ
β β β’ Who's affected? ββ
β β β’ What's the severity? ββ
β β ββ
β β ACTION: ββ
β β Assign severity ββ
β β Page appropriate responders ββ
β β Open incident channel ββ
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β PHASE 3: RESPOND β
β ββββββββββββββββ β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β INVESTIGATE: ββ
β β β’ Check logs, metrics, recent changes ββ
β β β’ Form hypotheses ββ
β β β’ Test and validate ββ
β β ββ
β β FIX: ββ
β β β’ Apply remediation (restart, rollback, patch) ββ
β β β’ Verify service restored ββ
β β β’ Monitor for recurrence ββ
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β PHASE 4: RECOVER β
β ββββββββββββββββ β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β β’ Confirm service stable ββ
β β β’ Close incident ββ
β β β’ Schedule postmortem ββ
β β β’ Communicate resolution ββ
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β PHASE 5: LEARN β
β ββββββββββββββ β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β β’ Conduct postmortem ββ
β β β’ Document findings ββ
β β β’ Create follow-up tasks ββ
β β β’ Share learnings ββ
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Roles During Incidents
Incident Team
INCIDENT ROLES:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β INCIDENT COMMANDER (IC): β
β ββββββββββββββββββββββββ β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β RESPONSIBILITIES: ββ
β β β’ Coordinates response ββ
β β β’ Makes decisions when unclear ββ
β β β’ Ensures communication happens ββ
β β β’ Delegates tasks ββ
β β ββ
β β DOES NOT: ββ
β β β’ Debug the issue (usually) ββ
β β β’ Write code ββ
β β ββ
β β SAYS: ββ
β β "Alex, investigate database connections" ββ
β β "Jordan, update status page" ββ
β β "What's our current theory?" ββ
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β TECHNICAL RESPONDERS: β
β βββββββββββββββββββββ β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β RESPONSIBILITIES: ββ
β β β’ Investigate root cause ββ
β β β’ Propose and implement fixes ββ
β β β’ Report findings to IC ββ
β β ββ
β β SAYS: ββ
β β "I see connection pool exhausted in logs" ββ
β β "Restarting service now" ββ
β β "Fix deployed, monitoring" ββ
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β COMMUNICATIONS LEAD: β
β ββββββββββββββββββββ β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β RESPONSIBILITIES: ββ
β β β’ Update status page ββ
β β β’ Communicate with customers ββ
β β β’ Keep stakeholders informed ββ
β β ββ
β β For smaller teams: IC handles this ββ
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β SCRIBE: β
β βββββββ β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β RESPONSIBILITIES: ββ
β β β’ Document timeline ββ
β β β’ Record actions taken ββ
β β β’ Note key findings ββ
β β ββ
β β Essential for postmortem accuracy ββ
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Communication
During Incidents
INCIDENT COMMUNICATION:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β INTERNAL COMMUNICATION: β
β βββββββββββββββββββββββ β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β #incident-2025-01-21-api-outage ββ
β β ββ
β β 14:32 [IC] Incident declared - API returning 503s ββ
β β Severity: SEV-1 ββ
β β Impact: All users affected ββ
β β ββ
β β 14:35 [Alex] Investigating. DB connections look high. ββ
β β ββ
β β 14:38 [IC] Alex continue DB. Jordan check app logs. ββ
β β ββ
β β 14:42 [Alex] Connection pool exhausted. Recent ββ
β β deploy added query without closing connections.ββ
β β ββ
β β 14:45 [IC] Decision: Rollback to previous version. ββ
β β Alex proceed with rollback. ββ
β β ββ
β β 14:48 [Alex] Rollback complete. Monitoring. ββ
β β ββ
β β 14:55 [IC] Service restored. Error rate back to ββ
β β normal. Keeping incident open for 15 min. ββ
β β ββ
β β 15:10 [IC] Incident resolved. Postmortem scheduled ββ
β β for tomorrow 10 AM. ββ
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β EXTERNAL COMMUNICATION: β
β ββββββββββββββββββββββββ β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β STATUS PAGE UPDATES: ββ
β β ββ
β β 14:35 - Investigating ββ
β β "We are investigating reports of API errors. ββ
β β Some users may experience issues accessing the ββ
β β service. We will provide updates as we learn more." ββ
β β ββ
β β 14:50 - Identified ββ
β β "We have identified the cause and are deploying ββ
β β a fix. Service should be restored within 15 minutes."ββ
β β ββ
β β 15:10 - Resolved ββ
β β "The issue has been resolved. API is fully ββ
β β operational. We apologize for any inconvenience." ββ
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β UPDATE FREQUENCY: β
β Every 15-30 min during active incident β
β Even if just "Still investigating, no new info" β
β Silence is worse than no news β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Blameless Postmortems
Learning from Incidents
BLAMELESS POSTMORTEM:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β CORE PRINCIPLE: β
β βββββββββββββββ β
β Focus on WHAT happened, not WHO caused it β
β People did the best they could with information they had β
β Blame creates fear, fear hides problems β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β POSTMORTEM STRUCTURE: β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β INCIDENT POSTMORTEM ββ
β β Date: January 21, 2025 ββ
β β Duration: 43 minutes ββ
β β Severity: SEV-1 ββ
β β Impact: 100% of users, ~$15K revenue loss ββ
β β ββ
β β SUMMARY: ββ
β β API returned 503 errors for 43 minutes due to ββ
β β database connection pool exhaustion caused by ββ
β β a query that didn't close connections properly. ββ
β β ββ
β β TIMELINE: ββ
β β 14:30 - Deploy with new query goes out ββ
β β 14:32 - Alerts fire for 503 errors ββ
β β 14:32 - Incident declared ββ
β β 14:42 - Root cause identified ββ
β β 14:48 - Rollback completed ββ
β β 14:55 - Service restored ββ
β β 15:10 - Incident closed ββ
β β ββ
β β ROOT CAUSE: ββ
β β New database query in user service didn't release ββ
β β connections after use. Under load, pool exhausted. ββ
β β ββ
β β CONTRIBUTING FACTORS: ββ
β β β’ No load testing for new query ββ
β β β’ Connection pool monitoring didn't alert ββ
β β β’ Code review didn't catch missing connection close ββ
β β ββ
β β WHAT WENT WELL: ββ
β β β’ Quick detection (2 min to alert) ββ
β β β’ Fast rollback capability ββ
β β β’ Clear incident communication ββ
β β ββ
β β WHAT COULD BE IMPROVED: ββ
β β β’ Catch this class of bug before production ββ
β β β’ Better connection pool visibility ββ
β β ββ
β β ACTION ITEMS: ββ
β β β Add linter rule for connection handling (Alex) ββ
β β β Add connection pool alert threshold (Jordan) ββ
β β β Load test for queries > 10 rows (Team) ββ
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β ACTION ITEMS β BACKLOG β SCHEDULED β
β Don't let learnings get lost β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
On-Call Practices
Sustainable Response
ON-CALL BEST PRACTICES:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β ON-CALL ROTATION: β
β βββββββββββββββββ β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β SCHEDULE: ββ
β β Week 1: Alex (primary), Jordan (secondary) ββ
β β Week 2: Jordan (primary), Sam (secondary) ββ
β β Week 3: Sam (primary), Taylor (secondary) ββ
β β Week 4: Taylor (primary), Alex (secondary) ββ
β β ββ
β β RULES: ββ
β β β’ Week-long rotations ββ
β β β’ Secondary is backup, not full on-call ββ
β β β’ Handoff meeting at rotation change ββ
β β β’ On-call gets comp time after heavy week ββ
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β SUSTAINABLE ON-CALL: β
β ββββββββββββββββββββ β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β HEALTHY TARGETS: ββ
β β β’ <2 pages per week average ββ
β β β’ <1 night page per month ββ
β β β’ >4 people in rotation ββ
β β ββ
β β IF EXCEEDING: ββ
β β β’ Fix noisy alerts ββ
β β β’ Improve system reliability ββ
β β β’ Add more people to rotation ββ
β β ββ
β β BURNOUT WARNING SIGNS: ββ
β β β’ Same person always gets paged ββ
β β β’ People dreading on-call ββ
β β β’ High turnover on on-call team ββ
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β RUNBOOKS: β
β βββββββββ β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β Every alert should have a runbook link: ββ
β β ββ
β β ALERT: High error rate on API ββ
β β RUNBOOK: docs/runbooks/api-high-errors.md ββ
β β ββ
β β Runbook contains: ββ
β β β’ What to check first ββ
β β β’ Common causes and fixes ββ
β β β’ When to escalate ββ
β β β’ Who to contact ββ
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ