9 min read • Guide 811 of 877

Incident Response and Postmortems

Incidents happen. GitScrum helps teams document incidents, track remediation, and capture learnings for continuous improvement.

Incident Response

Response Process

INCIDENT RESPONSE FLOW:
┌─────────────────────────────────────────────────────────────┐
│                                                             │
│ 1. DETECT                                                   │
│ ──────────                                                  │
│ • Monitoring alert triggers                               │
│ • Customer reports issue                                  │
│ • Team member notices problem                             │
│                                                             │
│ 2. TRIAGE                                                   │
│ ─────────                                                   │
│ • Assess severity (SEV1, SEV2, SEV3)                      │
│ • Identify impacted services                              │
│ • Page appropriate on-call                                │
│                                                             │
│ 3. RESPOND                                                  │
│ ──────────                                                  │
│ • Assemble incident team                                  │
│ • Open incident channel                                   │
│ • Begin investigation                                     │
│                                                             │
│ 4. MITIGATE                                                 │
│ ───────────                                                 │
│ • Focus on restoring service                              │
│ • Roll back if needed                                     │
│ • Apply temporary fixes                                   │
│                                                             │
│ 5. RESOLVE                                                  │
│ ──────────                                                  │
│ • Confirm service restored                                │
│ • Monitor for stability                                   │
│ • Communicate resolution                                  │
│                                                             │
│ 6. LEARN                                                    │
│ ────────                                                    │
│ • Schedule postmortem                                     │
│ • Document timeline                                       │
│ • Create action items                                     │
│                                                             │
│ ─────────────────────────────────────────────────────────── │
│                                                             │
│ SEVERITY LEVELS:                                            │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ SEV1 (Critical):                                        ││
│ │ Complete outage, all users affected                    ││
│ │ Response: Immediate, all hands                         ││
│ │                                                         ││
│ │ SEV2 (Major):                                           ││
│ │ Degraded service, many users affected                  ││
│ │ Response: Within 30 min, primary on-call               ││
│ │                                                         ││
│ │ SEV3 (Minor):                                           ││
│ │ Limited impact, workaround exists                      ││
│ │ Response: Next business day                            ││
│ └─────────────────────────────────────────────────────────┘│
└─────────────────────────────────────────────────────────────┘

Incident Roles

INCIDENT TEAM ROLES:
┌─────────────────────────────────────────────────────────────┐
│                                                             │
│ INCIDENT COMMANDER (IC):                                    │
│ ────────────────────────                                    │
│ • Coordinates response                                    │
│ • Makes decisions                                          │
│ • Assigns tasks                                            │
│ • Communicates with stakeholders                          │
│ • Calls for additional help                               │
│                                                             │
│ TECHNICAL LEAD:                                             │
│ ───────────────                                             │
│ • Leads investigation                                     │
│ • Directs debugging                                       │
│ • Proposes mitigations                                    │
│ • Implements fixes                                        │
│                                                             │
│ COMMUNICATIONS:                                             │
│ ───────────────                                             │
│ • Posts status updates                                    │
│ • Updates status page                                     │
│ • Communicates with customers                             │
│ • Keeps stakeholders informed                             │
│                                                             │
│ SCRIBE:                                                     │
│ ───────                                                     │
│ • Documents timeline                                      │
│ • Records decisions                                       │
│ • Captures what was tried                                 │
│ • Prepares for postmortem                                 │
│                                                             │
│ ─────────────────────────────────────────────────────────── │
│                                                             │
│ ROLE ASSIGNMENT:                                            │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ INCIDENT: Payment Processing Down                      ││
│ │ SEVERITY: SEV1                                          ││
│ │ TIME: 2025-01-15 14:30 UTC                             ││
│ │                                                         ││
│ │ IC:           @jordan                                   ││
│ │ Tech Lead:    @alex                                     ││
│ │ Comms:        @sam                                      ││
│ │ Scribe:       @pat                                      ││
│ │                                                         ││
│ │ Channel: #incident-2025-01-15-payments                 ││
│ │ Status Page: https://status.acme.co                    ││
│ └─────────────────────────────────────────────────────────┘│
│                                                             │
│ Clear roles prevent confusion during chaos               │
└─────────────────────────────────────────────────────────────┘

Postmortem Process

Blameless Postmortems

BLAMELESS POSTMORTEM PRINCIPLES:
┌─────────────────────────────────────────────────────────────┐
│                                                             │
│ CORE PRINCIPLE:                                             │
│ ───────────────                                             │
│ People don't cause incidents—systems do                   │
│ Focus on: What allowed this to happen?                    │
│                                                             │
│ BLAMELESS LANGUAGE:                                         │
│ ───────────────────                                         │
│                                                             │
│ ❌ "John pushed bad code"                                  │
│ ✅ "The deployment pipeline didn't catch the bug"         │
│                                                             │
│ ❌ "Sarah didn't notice the alert"                         │
│ ✅ "The alert wasn't configured to page"                  │
│                                                             │
│ ❌ "The team was careless"                                 │
│ ✅ "The process didn't have sufficient safeguards"        │
│                                                             │
│ ─────────────────────────────────────────────────────────── │
│                                                             │
│ WHY BLAMELESS:                                              │
│ ──────────────                                              │
│                                                             │
│ BLAME:                                                      │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ → People hide mistakes                                 ││
│ │ → Less information shared                              ││
│ │ → Root causes stay hidden                              ││
│ │ → Incidents recur                                       ││
│ │ → Culture of fear                                       ││
│ └─────────────────────────────────────────────────────────┘│
│                                                             │
│ BLAMELESS:                                                  │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ → People speak openly                                  ││
│ │ → Full context available                               ││
│ │ → True root causes found                               ││
│ │ → Systemic fixes implemented                           ││
│ │ → Culture of learning                                   ││
│ └─────────────────────────────────────────────────────────┘│
│                                                             │
│ "If someone can make a mistake, the system is fragile"    │
└─────────────────────────────────────────────────────────────┘

Postmortem Template

POSTMORTEM DOCUMENT:
┌─────────────────────────────────────────────────────────────┐
│                                                             │
│ INCIDENT: Payment Processing Outage                        │
│ DATE: January 15, 2025                                     │
│ DURATION: 47 minutes                                       │
│ SEVERITY: SEV1                                              │
│ AUTHOR: @jordan                                             │
│                                                             │
│ ─────────────────────────────────────────────────────────── │
│                                                             │
│ SUMMARY:                                                    │
│ ────────                                                    │
│ Payment processing was unavailable for 47 minutes         │
│ due to database connection pool exhaustion following     │
│ a traffic spike.                                           │
│                                                             │
│ IMPACT:                                                     │
│ ───────                                                     │
│ • ~2,500 failed payment attempts                          │
│ • Customer support tickets: 143                           │
│ • Estimated revenue impact: $45,000                       │
│                                                             │
│ ─────────────────────────────────────────────────────────── │
│                                                             │
│ TIMELINE:                                                   │
│ ─────────                                                   │
│ 14:23 - Traffic spike begins (marketing campaign)         │
│ 14:28 - Connection pool warnings in logs                  │
│ 14:32 - First customer reports failed payment             │
│ 14:35 - Pager duty alert triggers                         │
│ 14:38 - Incident channel created, IC assigned             │
│ 14:45 - Root cause identified (connection exhaustion)     │
│ 14:52 - Decision to increase pool size                    │
│ 15:05 - Configuration deployed                            │
│ 15:12 - Payments recovering                                │
│ 15:19 - All systems normal, monitoring                    │
│                                                             │
│ ─────────────────────────────────────────────────────────── │
│                                                             │
│ ROOT CAUSE:                                                 │
│ ───────────                                                 │
│ Database connection pool was sized for normal traffic.    │
│ Marketing campaign drove 4x normal traffic without       │
│ advance notice to engineering.                            │
│                                                             │
│ CONTRIBUTING FACTORS:                                       │
│ ─────────────────────                                       │
│ • No auto-scaling for connection pools                    │
│ • Alert threshold too high (triggered late)               │
│ • Marketing/engineering communication gap                 │
│ • No load testing for high-traffic scenarios              │
│                                                             │
│ ─────────────────────────────────────────────────────────── │
│                                                             │
│ WHAT WENT WELL:                                             │
│ ────────────────                                            │
│ • Incident response was quick once triggered              │
│ • Root cause identified in under 10 minutes               │
│ • Fix was straightforward                                  │
│ • Customer communication was timely                       │
│                                                             │
│ WHAT COULD BE IMPROVED:                                     │
│ ────────────────────────                                    │
│ • Earlier detection (alerts didn't fire soon enough)      │
│ • Cross-team communication about traffic-driving events  │
│ • Auto-scaling configuration                               │
│                                                             │
│ ─────────────────────────────────────────────────────────── │
│                                                             │
│ ACTION ITEMS:                                               │
│ ─────────────                                               │
│ 1. Implement auto-scaling for connection pools   @alex    │
│    Due: Jan 22                                             │
│                                                             │
│ 2. Lower alert threshold from 80% to 60%         @pat     │
│    Due: Jan 17                                             │
│                                                             │
│ 3. Add marketing calendar to engineering         @jordan  │
│    Due: Jan 19                                             │
│                                                             │
│ 4. Load test at 5x normal traffic                @sam     │
│    Due: Jan 31                                             │
└─────────────────────────────────────────────────────────────┘

Action Item Tracking

Following Through

POSTMORTEM ACTION TRACKING:
┌─────────────────────────────────────────────────────────────┐
│                                                             │
│ COMMON FAILURE:                                             │
│ ───────────────                                             │
│ Postmortems happen, action items get lost                 │
│ Same incident recurs 3 months later                       │
│                                                             │
│ ─────────────────────────────────────────────────────────── │
│                                                             │
│ TRACKING IN GITSCRUM:                                       │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ PROJECT: Reliability                                    ││
│ │                                                         ││
│ │ POSTMORTEM ACTION ITEMS                                  ││
│ │ ─────────────────────────                                ││
│ │                                                         ││
│ │ ┌───────────────────────────────────────────────────────┐││
│ │ │ [INC-456] Auto-scaling for connection pools          │││
│ │ │ Status: In Progress                                   │││
│ │ │ Source: Postmortem 2025-01-15                        │││
│ │ │ Priority: P1                                          │││
│ │ │ Due: Jan 22                                           │││
│ │ │ Owner: @alex                                          │││
│ │ └───────────────────────────────────────────────────────┘││
│ │                                                         ││
│ │ ┌───────────────────────────────────────────────────────┐││
│ │ │ [INC-457] Lower connection pool alert threshold      │││
│ │ │ Status: Done ✓                                        │││
│ │ │ Source: Postmortem 2025-01-15                        │││
│ │ │ Completed: Jan 16                                     │││
│ │ └───────────────────────────────────────────────────────┘││
│ │                                                         ││
│ │ ┌───────────────────────────────────────────────────────┐││
│ │ │ [INC-458] Marketing calendar integration             │││
│ │ │ Status: Not Started                                   │││
│ │ │ Source: Postmortem 2025-01-15                        │││
│ │ │ Due: Jan 19                                           │││
│ │ │ Owner: @jordan                                        │││
│ │ └───────────────────────────────────────────────────────┘││
│ └─────────────────────────────────────────────────────────┘│
│                                                             │
│ WEEKLY REVIEW:                                              │
│ ──────────────                                              │
│ • Track open postmortem actions                           │
│ • Report on completion rate                               │
│ • Escalate overdue items                                  │
│ • Celebrate completed improvements                        │
│                                                             │
│ METRICS:                                                    │
│ • Postmortem action completion rate                       │
│ • Time to complete actions                                │
│ • Recurring incident rate                                 │
└─────────────────────────────────────────────────────────────┘

Building Resilience

Learning Culture

INCIDENT LEARNING CULTURE:
┌─────────────────────────────────────────────────────────────┐
│                                                             │
│ SHARE LEARNINGS:                                            │
│ ────────────────                                            │
│                                                             │
│ INCIDENT REVIEW MEETINGS:                                   │
│ Monthly session to share postmortems across teams         │
│ Learn from others' incidents                              │
│                                                             │
│ INCIDENT NEWSLETTER:                                        │
│ Weekly summary of recent incidents                        │
│ Key learnings and action items                            │
│                                                             │
│ INCIDENT WIKI:                                              │
│ Searchable archive of postmortems                         │
│ Patterns and common issues                                │
│                                                             │
│ ─────────────────────────────────────────────────────────── │
│                                                             │
│ GAME DAYS:                                                  │
│ ──────────                                                  │
│ Practice incident response before real incidents          │
│                                                             │
│ • Simulate outage scenarios                               │
│ • Practice incident roles                                 │
│ • Test runbooks                                            │
│ • Find gaps in monitoring                                 │
│                                                             │
│ ─────────────────────────────────────────────────────────── │
│                                                             │
│ CELEBRATE NEAR-MISSES:                                      │
│ ──────────────────────                                      │
│ "We caught this before customers noticed"                 │
│ Just as valuable as postmortems                          │
│ Document what prevented the outage                        │
│                                                             │
│ METRICS TO TRACK:                                           │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ • Mean Time to Detect (MTTD)                           ││
│ │ • Mean Time to Resolve (MTTR)                          ││
│ │ • Incident frequency                                    ││
│ │ • Recurring incident rate                               ││
│ │ • Postmortem action completion rate                    ││
│ │ • Near-miss to incident ratio                          ││
│ └─────────────────────────────────────────────────────────┘│
│                                                             │
│ GOAL: Get better at detecting, responding, and learning   │
└─────────────────────────────────────────────────────────────┘

Back to How To Guides