7 min read • Guide 341 of 877
Incident Response Workflow
Incidents happen. What matters is how you respond. Good incident response minimizes customer impact, reduces stress, and creates learning opportunities. Poor response extends outages and burns out teams. This guide covers practical incident response workflows.
Incident Phases
| Phase | Focus | Duration |
|---|---|---|
| Detection | Alert triggered | Minutes |
| Triage | Assess severity | Minutes |
| Response | Fix/mitigate | Variable |
| Communication | Update stakeholders | Ongoing |
| Resolution | Service restored | - |
| Post-mortem | Learn and improve | Days |
Severity Levels
Classification
INCIDENT SEVERITY
═════════════════
P1 - CRITICAL:
─────────────────────────────────────
Impact:
├── Full service outage
├── Major feature completely down
├── Security breach
├── Data loss/corruption
├── All customers affected
└── Business critical
Response:
├── All hands on deck
├── Immediate escalation
├── C-level informed
├── External communication
├── Drop everything
└── Until resolved
P2 - HIGH:
─────────────────────────────────────
Impact:
├── Significant feature impaired
├── Workaround may exist
├── Many customers affected
├── Service degraded
└── Major inconvenience
Response:
├── Dedicated responders
├── Manager informed
├── Customer support aware
├── High priority fix
└── Resolved within hours
P3 - MEDIUM:
─────────────────────────────────────
Impact:
├── Minor feature affected
├── Limited customer impact
├── Workaround available
├── Degraded experience
└── Inconvenient, not critical
Response:
├── Normal priority
├── Resolved within days
├── No escalation needed
├── Standard process
└── Scheduled fix
P4 - LOW:
─────────────────────────────────────
Impact:
├── Cosmetic issues
├── Minimal impact
├── Few customers notice
└── Minor annoyance
Response:
├── Backlog priority
├── Fix when convenient
├── Regular process
└── No urgency
Response Process
Structured Response
INCIDENT RESPONSE WORKFLOW
══════════════════════════
DETECTION:
─────────────────────────────────────
How incidents are detected:
├── Automated monitoring alerts
├── Customer reports
├── Internal reports
├── Synthetic monitoring
├── Error rate spikes
└── Multiple signals
TRIAGE:
─────────────────────────────────────
First 5 minutes:
├── What's broken?
├── Who's affected?
├── What's the severity?
├── Who needs to know?
├── Quick assessment
└── Don't debug—triage
ASSEMBLE TEAM:
─────────────────────────────────────
Based on severity:
├── Incident commander (leads response)
├── Technical responders
├── Communications lead
├── Subject matter experts
├── Clear roles
└── Not everyone—right people
INVESTIGATE:
─────────────────────────────────────
Find the cause:
├── Check recent changes
├── Review logs and metrics
├── Compare to healthy baseline
├── Form hypotheses
├── Test hypotheses
└── Find root cause
MITIGATE:
─────────────────────────────────────
Priority: Restore service
├── Rollback if deploy-related
├── Feature flag disable
├── Scale resources
├── Redirect traffic
├── Temporary fix is OK
├── Permanent fix later
└── Customer first
RESOLVE:
─────────────────────────────────────
Service restored:
├── Verify all systems healthy
├── Monitor for recurrence
├── Communicate resolution
├── Document what happened
├── Schedule post-mortem
└── Incident closed
Communication
Stakeholder Updates
INCIDENT COMMUNICATION
══════════════════════
INTERNAL COMMUNICATION:
─────────────────────────────────────
Slack incident channel:
├── Create channel: #inc-2024-01-15-api-outage
├── Post regular updates
├── Tag relevant people
├── Timeline of events
├── Actions being taken
├── Single source of truth
└── Don't scatter info
Update cadence:
├── Every 15 min for P1
├── Every 30 min for P2
├── As needed for P3/P4
├── More frequent better
└── Stakeholders informed
EXTERNAL COMMUNICATION:
─────────────────────────────────────
Status page:
├── Acknowledge incident
├── Description (not technical)
├── Estimated resolution
├── Updates as progress
├── Resolution announcement
├── Transparent with customers
Template:
"We are currently experiencing issues
with [service]. We are investigating
and will provide updates.
Impact: [what users experience]
Started: [time]
Status: Investigating
Last update: [time]
Next update in: [duration]"
STAKEHOLDER UPDATES:
─────────────────────────────────────
Leadership updates:
├── Impact summary
├── Customer impact
├── Business impact
├── ETA if known
├── What we're doing
├── Executive summary
└── Don't oversimplify
On-Call
Response Readiness
ON-CALL STRUCTURE
═════════════════
ROTATION:
─────────────────────────────────────
Typical setup:
├── Primary on-call
├── Secondary backup
├── Weekly rotation
├── Fair distribution
├── No single point of failure
└── Documented schedule
EXPECTATIONS:
─────────────────────────────────────
On-call person:
├── Respond to alerts in 15 min
├── Laptop and internet access
├── Available during hours
├── Escalate if needed
├── Don't hero—ask for help
└── Clear expectations
COMPENSATION:
─────────────────────────────────────
Fair on-call:
├── Extra pay or time off
├── Respect off-hours
├── Don't burn out
├── Rotate fairly
├── Acknowledge burden
└── Sustainable system
RUNBOOKS:
─────────────────────────────────────
Documented procedures:
├── Common incidents
├── Step-by-step fixes
├── Escalation paths
├── Contact information
├── On-call can handle
└── Reduce knowledge dependency
Post-Mortem
Learning from Incidents
BLAMELESS POST-MORTEM
═════════════════════
PRINCIPLES:
─────────────────────────────────────
Blameless culture:
├── Focus on systems, not people
├── People did their best
├── Ask "how" not "who"
├── Learn, don't punish
├── Psychological safety
├── Honest discussion
└── Improve the system
POST-MORTEM TEMPLATE:
─────────────────────────────────────
1. Incident Summary
- What happened (1-2 sentences)
- Impact (users, duration, revenue)
2. Timeline
- Detection time
- Key events
- Resolution time
3. Root Cause
- What caused the incident
- 5 Whys analysis
4. What Went Well
- What helped resolution
- What we should keep doing
5. What Could Improve
- What slowed us down
- What we should change
6. Action Items
- Specific improvements
- Owner and due date
TIMELINE EXAMPLE:
─────────────────────────────────────
14:32 - Alert: API error rate >5%
14:35 - On-call acknowledges
14:38 - Incident channel created
14:42 - Root cause identified: bad deploy
14:45 - Rollback initiated
14:48 - Service restored
14:52 - Monitoring confirms recovery
15:00 - Incident closed
Total duration: 28 minutes
Customer impact: 16 minutes
ACTION ITEMS:
─────────────────────────────────────
├── Add canary deployment (owner: Sarah, due: Jan 30)
├── Improve error alerting (owner: Mike, due: Feb 5)
├── Update runbook (owner: Alex, due: Jan 25)
└── Tracked as tasks
GitScrum Integration
Incident Tracking
GITSCRUM FOR INCIDENTS
══════════════════════
INCIDENT TASKS:
─────────────────────────────────────
Create incident task:
├── Title: [P1] API Outage 2024-01-15
├── Label: incident, severity/p1
├── Assigned: Incident commander
├── Description: Summary and links
├── Action items as subtasks
└── Tracked in backlog
ACTION ITEM TRACKING:
─────────────────────────────────────
Post-mortem actions:
├── Each action is a task
├── Owner assigned
├── Due date set
├── Linked to incident
├── Tracked to completion
└── Accountable follow-through
REPORTING:
─────────────────────────────────────
├── Incident count by severity
├── Time to resolve
├── Action item completion
├── Trends over time
├── Dashboard visibility
└── Improvement metrics
Best Practices
For Incident Response
- Clear severity levels — Appropriate response
- Defined roles — Who does what
- Communication cadence — Regular updates
- Blameless culture — Learn, not blame
- Action item tracking — Prevent recurrence
Anti-Patterns
INCIDENT RESPONSE MISTAKES:
✗ No severity classification
✗ Hero culture (one person handles all)
✗ Blame in post-mortems
✗ No communication during incidents
✗ No post-mortems
✗ Action items never completed
✗ Same incidents repeat
✗ On-call burnout