Incident Response Workflow | Severity, Roles, Post-Mortem
Respond to production incidents with structured phases: detect, triage, respond, communicate, resolve, and learn. GitScrum tracks incidents and action items.
7 min read
Incidents happen. What matters is how you respond. Good incident response minimizes customer impact, reduces stress, and creates learning opportunities. Poor response extends outages and burns out teams. This guide covers practical incident response workflows.
Incident Phases
| Phase | Focus | Duration |
|---|---|---|
| Detection | Alert triggered | Minutes |
| Triage | Assess severity | Minutes |
| Response | Fix/mitigate | Variable |
| Communication | Update stakeholders | Ongoing |
| Resolution | Service restored | - |
| Post-mortem | Learn and improve | Days |
Severity Levels
Classification
INCIDENT SEVERITY
βββββββββββββββββ
P1 - CRITICAL:
βββββββββββββββββββββββββββββββββββββ
Impact:
βββ Full service outage
βββ Major feature completely down
βββ Security breach
βββ Data loss/corruption
βββ All customers affected
βββ Business critical
Response:
βββ All hands on deck
βββ Immediate escalation
βββ C-level informed
βββ External communication
βββ Drop everything
βββ Until resolved
P2 - HIGH:
βββββββββββββββββββββββββββββββββββββ
Impact:
βββ Significant feature impaired
βββ Workaround may exist
βββ Many customers affected
βββ Service degraded
βββ Major inconvenience
Response:
βββ Dedicated responders
βββ Manager informed
βββ Customer support aware
βββ High priority fix
βββ Resolved within hours
P3 - MEDIUM:
βββββββββββββββββββββββββββββββββββββ
Impact:
βββ Minor feature affected
βββ Limited customer impact
βββ Workaround available
βββ Degraded experience
βββ Inconvenient, not critical
Response:
βββ Normal priority
βββ Resolved within days
βββ No escalation needed
βββ Standard process
βββ Scheduled fix
P4 - LOW:
βββββββββββββββββββββββββββββββββββββ
Impact:
βββ Cosmetic issues
βββ Minimal impact
βββ Few customers notice
βββ Minor annoyance
Response:
βββ Backlog priority
βββ Fix when convenient
βββ Regular process
βββ No urgency
Response Process
Structured Response
INCIDENT RESPONSE WORKFLOW
ββββββββββββββββββββββββββ
DETECTION:
βββββββββββββββββββββββββββββββββββββ
How incidents are detected:
βββ Automated monitoring alerts
βββ Customer reports
βββ Internal reports
βββ Synthetic monitoring
βββ Error rate spikes
βββ Multiple signals
TRIAGE:
βββββββββββββββββββββββββββββββββββββ
First 5 minutes:
βββ What's broken?
βββ Who's affected?
βββ What's the severity?
βββ Who needs to know?
βββ Quick assessment
βββ Don't debugβtriage
ASSEMBLE TEAM:
βββββββββββββββββββββββββββββββββββββ
Based on severity:
βββ Incident commander (leads response)
βββ Technical responders
βββ Communications lead
βββ Subject matter experts
βββ Clear roles
βββ Not everyoneβright people
INVESTIGATE:
βββββββββββββββββββββββββββββββββββββ
Find the cause:
βββ Check recent changes
βββ Review logs and metrics
βββ Compare to healthy baseline
βββ Form hypotheses
βββ Test hypotheses
βββ Find root cause
MITIGATE:
βββββββββββββββββββββββββββββββββββββ
Priority: Restore service
βββ Rollback if deploy-related
βββ Feature flag disable
βββ Scale resources
βββ Redirect traffic
βββ Temporary fix is OK
βββ Permanent fix later
βββ Customer first
RESOLVE:
βββββββββββββββββββββββββββββββββββββ
Service restored:
βββ Verify all systems healthy
βββ Monitor for recurrence
βββ Communicate resolution
βββ Document what happened
βββ Schedule post-mortem
βββ Incident closed
Communication
Stakeholder Updates
INCIDENT COMMUNICATION
ββββββββββββββββββββββ
INTERNAL COMMUNICATION:
βββββββββββββββββββββββββββββββββββββ
Slack incident channel:
βββ Create channel: #inc-2024-01-15-api-outage
βββ Post regular updates
βββ Tag relevant people
βββ Timeline of events
βββ Actions being taken
βββ Single source of truth
βββ Don't scatter info
Update cadence:
βββ Every 15 min for P1
βββ Every 30 min for P2
βββ As needed for P3/P4
βββ More frequent better
βββ Stakeholders informed
EXTERNAL COMMUNICATION:
βββββββββββββββββββββββββββββββββββββ
Status page:
βββ Acknowledge incident
βββ Description (not technical)
βββ Estimated resolution
βββ Updates as progress
βββ Resolution announcement
βββ Transparent with customers
Template:
"We are currently experiencing issues
with [service]. We are investigating
and will provide updates.
Impact: [what users experience]
Started: [time]
Status: Investigating
Last update: [time]
Next update in: [duration]"
STAKEHOLDER UPDATES:
βββββββββββββββββββββββββββββββββββββ
Leadership updates:
βββ Impact summary
βββ Customer impact
βββ Business impact
βββ ETA if known
βββ What we're doing
βββ Executive summary
βββ Don't oversimplify
On-Call
Response Readiness
ON-CALL STRUCTURE
βββββββββββββββββ
ROTATION:
βββββββββββββββββββββββββββββββββββββ
Typical setup:
βββ Primary on-call
βββ Secondary backup
βββ Weekly rotation
βββ Fair distribution
βββ No single point of failure
βββ Documented schedule
EXPECTATIONS:
βββββββββββββββββββββββββββββββββββββ
On-call person:
βββ Respond to alerts in 15 min
βββ Laptop and internet access
βββ Available during hours
βββ Escalate if needed
βββ Don't heroβask for help
βββ Clear expectations
COMPENSATION:
βββββββββββββββββββββββββββββββββββββ
Fair on-call:
βββ Extra pay or time off
βββ Respect off-hours
βββ Don't burn out
βββ Rotate fairly
βββ Acknowledge burden
βββ Sustainable system
RUNBOOKS:
βββββββββββββββββββββββββββββββββββββ
Documented procedures:
βββ Common incidents
βββ Step-by-step fixes
βββ Escalation paths
βββ Contact information
βββ On-call can handle
βββ Reduce knowledge dependency
Post-Mortem
Learning from Incidents
BLAMELESS POST-MORTEM
βββββββββββββββββββββ
PRINCIPLES:
βββββββββββββββββββββββββββββββββββββ
Blameless culture:
βββ Focus on systems, not people
βββ People did their best
βββ Ask "how" not "who"
βββ Learn, don't punish
βββ Psychological safety
βββ Honest discussion
βββ Improve the system
POST-MORTEM TEMPLATE:
βββββββββββββββββββββββββββββββββββββ
1. Incident Summary
- What happened (1-2 sentences)
- Impact (users, duration, revenue)
2. Timeline
- Detection time
- Key events
- Resolution time
3. Root Cause
- What caused the incident
- 5 Whys analysis
4. What Went Well
- What helped resolution
- What we should keep doing
5. What Could Improve
- What slowed us down
- What we should change
6. Action Items
- Specific improvements
- Owner and due date
TIMELINE EXAMPLE:
βββββββββββββββββββββββββββββββββββββ
14:32 - Alert: API error rate >5%
14:35 - On-call acknowledges
14:38 - Incident channel created
14:42 - Root cause identified: bad deploy
14:45 - Rollback initiated
14:48 - Service restored
14:52 - Monitoring confirms recovery
15:00 - Incident closed
Total duration: 28 minutes
Customer impact: 16 minutes
ACTION ITEMS:
βββββββββββββββββββββββββββββββββββββ
βββ Add canary deployment (owner: Sarah, due: Jan 30)
βββ Improve error alerting (owner: Mike, due: Feb 5)
βββ Update runbook (owner: Alex, due: Jan 25)
βββ Tracked as tasks
GitScrum Integration
Incident Tracking
GITSCRUM FOR INCIDENTS
ββββββββββββββββββββββ
INCIDENT TASKS:
βββββββββββββββββββββββββββββββββββββ
Create incident task:
βββ Title: [P1] API Outage 2024-01-15
βββ Label: incident, severity/p1
βββ Assigned: Incident commander
βββ Description: Summary and links
βββ Action items as subtasks
βββ Tracked in backlog
ACTION ITEM TRACKING:
βββββββββββββββββββββββββββββββββββββ
Post-mortem actions:
βββ Each action is a task
βββ Owner assigned
βββ Due date set
βββ Linked to incident
βββ Tracked to completion
βββ Accountable follow-through
REPORTING:
βββββββββββββββββββββββββββββββββββββ
βββ Incident count by severity
βββ Time to resolve
βββ Action item completion
βββ Trends over time
βββ Dashboard visibility
βββ Improvement metrics
Best Practices
For Incident Response
Anti-Patterns
INCIDENT RESPONSE MISTAKES:
β No severity classification
β Hero culture (one person handles all)
β Blame in post-mortems
β No communication during incidents
β No post-mortems
β Action items never completed
β Same incidents repeat
β On-call burnout