9 min read • Guide 811 of 877
Incident Response and Postmortems
Incidents happen. GitScrum helps teams document incidents, track remediation, and capture learnings for continuous improvement.
Incident Response
Response Process
INCIDENT RESPONSE FLOW:
┌─────────────────────────────────────────────────────────────┐
│ │
│ 1. DETECT │
│ ────────── │
│ • Monitoring alert triggers │
│ • Customer reports issue │
│ • Team member notices problem │
│ │
│ 2. TRIAGE │
│ ───────── │
│ • Assess severity (SEV1, SEV2, SEV3) │
│ • Identify impacted services │
│ • Page appropriate on-call │
│ │
│ 3. RESPOND │
│ ────────── │
│ • Assemble incident team │
│ • Open incident channel │
│ • Begin investigation │
│ │
│ 4. MITIGATE │
│ ─────────── │
│ • Focus on restoring service │
│ • Roll back if needed │
│ • Apply temporary fixes │
│ │
│ 5. RESOLVE │
│ ────────── │
│ • Confirm service restored │
│ • Monitor for stability │
│ • Communicate resolution │
│ │
│ 6. LEARN │
│ ──────── │
│ • Schedule postmortem │
│ • Document timeline │
│ • Create action items │
│ │
│ ─────────────────────────────────────────────────────────── │
│ │
│ SEVERITY LEVELS: │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ SEV1 (Critical): ││
│ │ Complete outage, all users affected ││
│ │ Response: Immediate, all hands ││
│ │ ││
│ │ SEV2 (Major): ││
│ │ Degraded service, many users affected ││
│ │ Response: Within 30 min, primary on-call ││
│ │ ││
│ │ SEV3 (Minor): ││
│ │ Limited impact, workaround exists ││
│ │ Response: Next business day ││
│ └─────────────────────────────────────────────────────────┘│
└─────────────────────────────────────────────────────────────┘
Incident Roles
INCIDENT TEAM ROLES:
┌─────────────────────────────────────────────────────────────┐
│ │
│ INCIDENT COMMANDER (IC): │
│ ──────────────────────── │
│ • Coordinates response │
│ • Makes decisions │
│ • Assigns tasks │
│ • Communicates with stakeholders │
│ • Calls for additional help │
│ │
│ TECHNICAL LEAD: │
│ ─────────────── │
│ • Leads investigation │
│ • Directs debugging │
│ • Proposes mitigations │
│ • Implements fixes │
│ │
│ COMMUNICATIONS: │
│ ─────────────── │
│ • Posts status updates │
│ • Updates status page │
│ • Communicates with customers │
│ • Keeps stakeholders informed │
│ │
│ SCRIBE: │
│ ─────── │
│ • Documents timeline │
│ • Records decisions │
│ • Captures what was tried │
│ • Prepares for postmortem │
│ │
│ ─────────────────────────────────────────────────────────── │
│ │
│ ROLE ASSIGNMENT: │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ INCIDENT: Payment Processing Down ││
│ │ SEVERITY: SEV1 ││
│ │ TIME: 2025-01-15 14:30 UTC ││
│ │ ││
│ │ IC: @jordan ││
│ │ Tech Lead: @alex ││
│ │ Comms: @sam ││
│ │ Scribe: @pat ││
│ │ ││
│ │ Channel: #incident-2025-01-15-payments ││
│ │ Status Page: https://status.acme.co ││
│ └─────────────────────────────────────────────────────────┘│
│ │
│ Clear roles prevent confusion during chaos │
└─────────────────────────────────────────────────────────────┘
Postmortem Process
Blameless Postmortems
BLAMELESS POSTMORTEM PRINCIPLES:
┌─────────────────────────────────────────────────────────────┐
│ │
│ CORE PRINCIPLE: │
│ ─────────────── │
│ People don't cause incidents—systems do │
│ Focus on: What allowed this to happen? │
│ │
│ BLAMELESS LANGUAGE: │
│ ─────────────────── │
│ │
│ ❌ "John pushed bad code" │
│ ✅ "The deployment pipeline didn't catch the bug" │
│ │
│ ❌ "Sarah didn't notice the alert" │
│ ✅ "The alert wasn't configured to page" │
│ │
│ ❌ "The team was careless" │
│ ✅ "The process didn't have sufficient safeguards" │
│ │
│ ─────────────────────────────────────────────────────────── │
│ │
│ WHY BLAMELESS: │
│ ────────────── │
│ │
│ BLAME: │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ → People hide mistakes ││
│ │ → Less information shared ││
│ │ → Root causes stay hidden ││
│ │ → Incidents recur ││
│ │ → Culture of fear ││
│ └─────────────────────────────────────────────────────────┘│
│ │
│ BLAMELESS: │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ → People speak openly ││
│ │ → Full context available ││
│ │ → True root causes found ││
│ │ → Systemic fixes implemented ││
│ │ → Culture of learning ││
│ └─────────────────────────────────────────────────────────┘│
│ │
│ "If someone can make a mistake, the system is fragile" │
└─────────────────────────────────────────────────────────────┘
Postmortem Template
POSTMORTEM DOCUMENT:
┌─────────────────────────────────────────────────────────────┐
│ │
│ INCIDENT: Payment Processing Outage │
│ DATE: January 15, 2025 │
│ DURATION: 47 minutes │
│ SEVERITY: SEV1 │
│ AUTHOR: @jordan │
│ │
│ ─────────────────────────────────────────────────────────── │
│ │
│ SUMMARY: │
│ ──────── │
│ Payment processing was unavailable for 47 minutes │
│ due to database connection pool exhaustion following │
│ a traffic spike. │
│ │
│ IMPACT: │
│ ─────── │
│ • ~2,500 failed payment attempts │
│ • Customer support tickets: 143 │
│ • Estimated revenue impact: $45,000 │
│ │
│ ─────────────────────────────────────────────────────────── │
│ │
│ TIMELINE: │
│ ───────── │
│ 14:23 - Traffic spike begins (marketing campaign) │
│ 14:28 - Connection pool warnings in logs │
│ 14:32 - First customer reports failed payment │
│ 14:35 - Pager duty alert triggers │
│ 14:38 - Incident channel created, IC assigned │
│ 14:45 - Root cause identified (connection exhaustion) │
│ 14:52 - Decision to increase pool size │
│ 15:05 - Configuration deployed │
│ 15:12 - Payments recovering │
│ 15:19 - All systems normal, monitoring │
│ │
│ ─────────────────────────────────────────────────────────── │
│ │
│ ROOT CAUSE: │
│ ─────────── │
│ Database connection pool was sized for normal traffic. │
│ Marketing campaign drove 4x normal traffic without │
│ advance notice to engineering. │
│ │
│ CONTRIBUTING FACTORS: │
│ ───────────────────── │
│ • No auto-scaling for connection pools │
│ • Alert threshold too high (triggered late) │
│ • Marketing/engineering communication gap │
│ • No load testing for high-traffic scenarios │
│ │
│ ─────────────────────────────────────────────────────────── │
│ │
│ WHAT WENT WELL: │
│ ──────────────── │
│ • Incident response was quick once triggered │
│ • Root cause identified in under 10 minutes │
│ • Fix was straightforward │
│ • Customer communication was timely │
│ │
│ WHAT COULD BE IMPROVED: │
│ ──────────────────────── │
│ • Earlier detection (alerts didn't fire soon enough) │
│ • Cross-team communication about traffic-driving events │
│ • Auto-scaling configuration │
│ │
│ ─────────────────────────────────────────────────────────── │
│ │
│ ACTION ITEMS: │
│ ───────────── │
│ 1. Implement auto-scaling for connection pools @alex │
│ Due: Jan 22 │
│ │
│ 2. Lower alert threshold from 80% to 60% @pat │
│ Due: Jan 17 │
│ │
│ 3. Add marketing calendar to engineering @jordan │
│ Due: Jan 19 │
│ │
│ 4. Load test at 5x normal traffic @sam │
│ Due: Jan 31 │
└─────────────────────────────────────────────────────────────┘
Action Item Tracking
Following Through
POSTMORTEM ACTION TRACKING:
┌─────────────────────────────────────────────────────────────┐
│ │
│ COMMON FAILURE: │
│ ─────────────── │
│ Postmortems happen, action items get lost │
│ Same incident recurs 3 months later │
│ │
│ ─────────────────────────────────────────────────────────── │
│ │
│ TRACKING IN GITSCRUM: │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ PROJECT: Reliability ││
│ │ ││
│ │ POSTMORTEM ACTION ITEMS ││
│ │ ───────────────────────── ││
│ │ ││
│ │ ┌───────────────────────────────────────────────────────┐││
│ │ │ [INC-456] Auto-scaling for connection pools │││
│ │ │ Status: In Progress │││
│ │ │ Source: Postmortem 2025-01-15 │││
│ │ │ Priority: P1 │││
│ │ │ Due: Jan 22 │││
│ │ │ Owner: @alex │││
│ │ └───────────────────────────────────────────────────────┘││
│ │ ││
│ │ ┌───────────────────────────────────────────────────────┐││
│ │ │ [INC-457] Lower connection pool alert threshold │││
│ │ │ Status: Done ✓ │││
│ │ │ Source: Postmortem 2025-01-15 │││
│ │ │ Completed: Jan 16 │││
│ │ └───────────────────────────────────────────────────────┘││
│ │ ││
│ │ ┌───────────────────────────────────────────────────────┐││
│ │ │ [INC-458] Marketing calendar integration │││
│ │ │ Status: Not Started │││
│ │ │ Source: Postmortem 2025-01-15 │││
│ │ │ Due: Jan 19 │││
│ │ │ Owner: @jordan │││
│ │ └───────────────────────────────────────────────────────┘││
│ └─────────────────────────────────────────────────────────┘│
│ │
│ WEEKLY REVIEW: │
│ ────────────── │
│ • Track open postmortem actions │
│ • Report on completion rate │
│ • Escalate overdue items │
│ • Celebrate completed improvements │
│ │
│ METRICS: │
│ • Postmortem action completion rate │
│ • Time to complete actions │
│ • Recurring incident rate │
└─────────────────────────────────────────────────────────────┘
Building Resilience
Learning Culture
INCIDENT LEARNING CULTURE:
┌─────────────────────────────────────────────────────────────┐
│ │
│ SHARE LEARNINGS: │
│ ──────────────── │
│ │
│ INCIDENT REVIEW MEETINGS: │
│ Monthly session to share postmortems across teams │
│ Learn from others' incidents │
│ │
│ INCIDENT NEWSLETTER: │
│ Weekly summary of recent incidents │
│ Key learnings and action items │
│ │
│ INCIDENT WIKI: │
│ Searchable archive of postmortems │
│ Patterns and common issues │
│ │
│ ─────────────────────────────────────────────────────────── │
│ │
│ GAME DAYS: │
│ ────────── │
│ Practice incident response before real incidents │
│ │
│ • Simulate outage scenarios │
│ • Practice incident roles │
│ • Test runbooks │
│ • Find gaps in monitoring │
│ │
│ ─────────────────────────────────────────────────────────── │
│ │
│ CELEBRATE NEAR-MISSES: │
│ ────────────────────── │
│ "We caught this before customers noticed" │
│ Just as valuable as postmortems │
│ Document what prevented the outage │
│ │
│ METRICS TO TRACK: │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ • Mean Time to Detect (MTTD) ││
│ │ • Mean Time to Resolve (MTTR) ││
│ │ • Incident frequency ││
│ │ • Recurring incident rate ││
│ │ • Postmortem action completion rate ││
│ │ • Near-miss to incident ratio ││
│ └─────────────────────────────────────────────────────────┘│
│ │
│ GOAL: Get better at detecting, responding, and learning │
└─────────────────────────────────────────────────────────────┘