Try free
9 min read Guide 811 of 877

Incident Response and Postmortems

Incidents happen. GitScrum helps teams document incidents, track remediation, and capture learnings for continuous improvement.

Incident Response

Response Process

INCIDENT RESPONSE FLOW:
┌─────────────────────────────────────────────────────────────┐
│                                                             │
│ 1. DETECT                                                   │
│ ──────────                                                  │
│ • Monitoring alert triggers                               │
│ • Customer reports issue                                  │
│ • Team member notices problem                             │
│                                                             │
│ 2. TRIAGE                                                   │
│ ─────────                                                   │
│ • Assess severity (SEV1, SEV2, SEV3)                      │
│ • Identify impacted services                              │
│ • Page appropriate on-call                                │
│                                                             │
│ 3. RESPOND                                                  │
│ ──────────                                                  │
│ • Assemble incident team                                  │
│ • Open incident channel                                   │
│ • Begin investigation                                     │
│                                                             │
│ 4. MITIGATE                                                 │
│ ───────────                                                 │
│ • Focus on restoring service                              │
│ • Roll back if needed                                     │
│ • Apply temporary fixes                                   │
│                                                             │
│ 5. RESOLVE                                                  │
│ ──────────                                                  │
│ • Confirm service restored                                │
│ • Monitor for stability                                   │
│ • Communicate resolution                                  │
│                                                             │
│ 6. LEARN                                                    │
│ ────────                                                    │
│ • Schedule postmortem                                     │
│ • Document timeline                                       │
│ • Create action items                                     │
│                                                             │
│ ─────────────────────────────────────────────────────────── │
│                                                             │
│ SEVERITY LEVELS:                                            │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ SEV1 (Critical):                                        ││
│ │ Complete outage, all users affected                    ││
│ │ Response: Immediate, all hands                         ││
│ │                                                         ││
│ │ SEV2 (Major):                                           ││
│ │ Degraded service, many users affected                  ││
│ │ Response: Within 30 min, primary on-call               ││
│ │                                                         ││
│ │ SEV3 (Minor):                                           ││
│ │ Limited impact, workaround exists                      ││
│ │ Response: Next business day                            ││
│ └─────────────────────────────────────────────────────────┘│
└─────────────────────────────────────────────────────────────┘

Incident Roles

INCIDENT TEAM ROLES:
┌─────────────────────────────────────────────────────────────┐
│                                                             │
│ INCIDENT COMMANDER (IC):                                    │
│ ────────────────────────                                    │
│ • Coordinates response                                    │
│ • Makes decisions                                          │
│ • Assigns tasks                                            │
│ • Communicates with stakeholders                          │
│ • Calls for additional help                               │
│                                                             │
│ TECHNICAL LEAD:                                             │
│ ───────────────                                             │
│ • Leads investigation                                     │
│ • Directs debugging                                       │
│ • Proposes mitigations                                    │
│ • Implements fixes                                        │
│                                                             │
│ COMMUNICATIONS:                                             │
│ ───────────────                                             │
│ • Posts status updates                                    │
│ • Updates status page                                     │
│ • Communicates with customers                             │
│ • Keeps stakeholders informed                             │
│                                                             │
│ SCRIBE:                                                     │
│ ───────                                                     │
│ • Documents timeline                                      │
│ • Records decisions                                       │
│ • Captures what was tried                                 │
│ • Prepares for postmortem                                 │
│                                                             │
│ ─────────────────────────────────────────────────────────── │
│                                                             │
│ ROLE ASSIGNMENT:                                            │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ INCIDENT: Payment Processing Down                      ││
│ │ SEVERITY: SEV1                                          ││
│ │ TIME: 2025-01-15 14:30 UTC                             ││
│ │                                                         ││
│ │ IC:           @jordan                                   ││
│ │ Tech Lead:    @alex                                     ││
│ │ Comms:        @sam                                      ││
│ │ Scribe:       @pat                                      ││
│ │                                                         ││
│ │ Channel: #incident-2025-01-15-payments                 ││
│ │ Status Page: https://status.acme.co                    ││
│ └─────────────────────────────────────────────────────────┘│
│                                                             │
│ Clear roles prevent confusion during chaos               │
└─────────────────────────────────────────────────────────────┘

Postmortem Process

Blameless Postmortems

BLAMELESS POSTMORTEM PRINCIPLES:
┌─────────────────────────────────────────────────────────────┐
│                                                             │
│ CORE PRINCIPLE:                                             │
│ ───────────────                                             │
│ People don't cause incidents—systems do                   │
│ Focus on: What allowed this to happen?                    │
│                                                             │
│ BLAMELESS LANGUAGE:                                         │
│ ───────────────────                                         │
│                                                             │
│ ❌ "John pushed bad code"                                  │
│ ✅ "The deployment pipeline didn't catch the bug"         │
│                                                             │
│ ❌ "Sarah didn't notice the alert"                         │
│ ✅ "The alert wasn't configured to page"                  │
│                                                             │
│ ❌ "The team was careless"                                 │
│ ✅ "The process didn't have sufficient safeguards"        │
│                                                             │
│ ─────────────────────────────────────────────────────────── │
│                                                             │
│ WHY BLAMELESS:                                              │
│ ──────────────                                              │
│                                                             │
│ BLAME:                                                      │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ → People hide mistakes                                 ││
│ │ → Less information shared                              ││
│ │ → Root causes stay hidden                              ││
│ │ → Incidents recur                                       ││
│ │ → Culture of fear                                       ││
│ └─────────────────────────────────────────────────────────┘│
│                                                             │
│ BLAMELESS:                                                  │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ → People speak openly                                  ││
│ │ → Full context available                               ││
│ │ → True root causes found                               ││
│ │ → Systemic fixes implemented                           ││
│ │ → Culture of learning                                   ││
│ └─────────────────────────────────────────────────────────┘│
│                                                             │
│ "If someone can make a mistake, the system is fragile"    │
└─────────────────────────────────────────────────────────────┘

Postmortem Template

POSTMORTEM DOCUMENT:
┌─────────────────────────────────────────────────────────────┐
│                                                             │
│ INCIDENT: Payment Processing Outage                        │
│ DATE: January 15, 2025                                     │
│ DURATION: 47 minutes                                       │
│ SEVERITY: SEV1                                              │
│ AUTHOR: @jordan                                             │
│                                                             │
│ ─────────────────────────────────────────────────────────── │
│                                                             │
│ SUMMARY:                                                    │
│ ────────                                                    │
│ Payment processing was unavailable for 47 minutes         │
│ due to database connection pool exhaustion following     │
│ a traffic spike.                                           │
│                                                             │
│ IMPACT:                                                     │
│ ───────                                                     │
│ • ~2,500 failed payment attempts                          │
│ • Customer support tickets: 143                           │
│ • Estimated revenue impact: $45,000                       │
│                                                             │
│ ─────────────────────────────────────────────────────────── │
│                                                             │
│ TIMELINE:                                                   │
│ ─────────                                                   │
│ 14:23 - Traffic spike begins (marketing campaign)         │
│ 14:28 - Connection pool warnings in logs                  │
│ 14:32 - First customer reports failed payment             │
│ 14:35 - Pager duty alert triggers                         │
│ 14:38 - Incident channel created, IC assigned             │
│ 14:45 - Root cause identified (connection exhaustion)     │
│ 14:52 - Decision to increase pool size                    │
│ 15:05 - Configuration deployed                            │
│ 15:12 - Payments recovering                                │
│ 15:19 - All systems normal, monitoring                    │
│                                                             │
│ ─────────────────────────────────────────────────────────── │
│                                                             │
│ ROOT CAUSE:                                                 │
│ ───────────                                                 │
│ Database connection pool was sized for normal traffic.    │
│ Marketing campaign drove 4x normal traffic without       │
│ advance notice to engineering.                            │
│                                                             │
│ CONTRIBUTING FACTORS:                                       │
│ ─────────────────────                                       │
│ • No auto-scaling for connection pools                    │
│ • Alert threshold too high (triggered late)               │
│ • Marketing/engineering communication gap                 │
│ • No load testing for high-traffic scenarios              │
│                                                             │
│ ─────────────────────────────────────────────────────────── │
│                                                             │
│ WHAT WENT WELL:                                             │
│ ────────────────                                            │
│ • Incident response was quick once triggered              │
│ • Root cause identified in under 10 minutes               │
│ • Fix was straightforward                                  │
│ • Customer communication was timely                       │
│                                                             │
│ WHAT COULD BE IMPROVED:                                     │
│ ────────────────────────                                    │
│ • Earlier detection (alerts didn't fire soon enough)      │
│ • Cross-team communication about traffic-driving events  │
│ • Auto-scaling configuration                               │
│                                                             │
│ ─────────────────────────────────────────────────────────── │
│                                                             │
│ ACTION ITEMS:                                               │
│ ─────────────                                               │
│ 1. Implement auto-scaling for connection pools   @alex    │
│    Due: Jan 22                                             │
│                                                             │
│ 2. Lower alert threshold from 80% to 60%         @pat     │
│    Due: Jan 17                                             │
│                                                             │
│ 3. Add marketing calendar to engineering         @jordan  │
│    Due: Jan 19                                             │
│                                                             │
│ 4. Load test at 5x normal traffic                @sam     │
│    Due: Jan 31                                             │
└─────────────────────────────────────────────────────────────┘

Action Item Tracking

Following Through

POSTMORTEM ACTION TRACKING:
┌─────────────────────────────────────────────────────────────┐
│                                                             │
│ COMMON FAILURE:                                             │
│ ───────────────                                             │
│ Postmortems happen, action items get lost                 │
│ Same incident recurs 3 months later                       │
│                                                             │
│ ─────────────────────────────────────────────────────────── │
│                                                             │
│ TRACKING IN GITSCRUM:                                       │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ PROJECT: Reliability                                    ││
│ │                                                         ││
│ │ POSTMORTEM ACTION ITEMS                                  ││
│ │ ─────────────────────────                                ││
│ │                                                         ││
│ │ ┌───────────────────────────────────────────────────────┐││
│ │ │ [INC-456] Auto-scaling for connection pools          │││
│ │ │ Status: In Progress                                   │││
│ │ │ Source: Postmortem 2025-01-15                        │││
│ │ │ Priority: P1                                          │││
│ │ │ Due: Jan 22                                           │││
│ │ │ Owner: @alex                                          │││
│ │ └───────────────────────────────────────────────────────┘││
│ │                                                         ││
│ │ ┌───────────────────────────────────────────────────────┐││
│ │ │ [INC-457] Lower connection pool alert threshold      │││
│ │ │ Status: Done ✓                                        │││
│ │ │ Source: Postmortem 2025-01-15                        │││
│ │ │ Completed: Jan 16                                     │││
│ │ └───────────────────────────────────────────────────────┘││
│ │                                                         ││
│ │ ┌───────────────────────────────────────────────────────┐││
│ │ │ [INC-458] Marketing calendar integration             │││
│ │ │ Status: Not Started                                   │││
│ │ │ Source: Postmortem 2025-01-15                        │││
│ │ │ Due: Jan 19                                           │││
│ │ │ Owner: @jordan                                        │││
│ │ └───────────────────────────────────────────────────────┘││
│ └─────────────────────────────────────────────────────────┘│
│                                                             │
│ WEEKLY REVIEW:                                              │
│ ──────────────                                              │
│ • Track open postmortem actions                           │
│ • Report on completion rate                               │
│ • Escalate overdue items                                  │
│ • Celebrate completed improvements                        │
│                                                             │
│ METRICS:                                                    │
│ • Postmortem action completion rate                       │
│ • Time to complete actions                                │
│ • Recurring incident rate                                 │
└─────────────────────────────────────────────────────────────┘

Building Resilience

Learning Culture

INCIDENT LEARNING CULTURE:
┌─────────────────────────────────────────────────────────────┐
│                                                             │
│ SHARE LEARNINGS:                                            │
│ ────────────────                                            │
│                                                             │
│ INCIDENT REVIEW MEETINGS:                                   │
│ Monthly session to share postmortems across teams         │
│ Learn from others' incidents                              │
│                                                             │
│ INCIDENT NEWSLETTER:                                        │
│ Weekly summary of recent incidents                        │
│ Key learnings and action items                            │
│                                                             │
│ INCIDENT WIKI:                                              │
│ Searchable archive of postmortems                         │
│ Patterns and common issues                                │
│                                                             │
│ ─────────────────────────────────────────────────────────── │
│                                                             │
│ GAME DAYS:                                                  │
│ ──────────                                                  │
│ Practice incident response before real incidents          │
│                                                             │
│ • Simulate outage scenarios                               │
│ • Practice incident roles                                 │
│ • Test runbooks                                            │
│ • Find gaps in monitoring                                 │
│                                                             │
│ ─────────────────────────────────────────────────────────── │
│                                                             │
│ CELEBRATE NEAR-MISSES:                                      │
│ ──────────────────────                                      │
│ "We caught this before customers noticed"                 │
│ Just as valuable as postmortems                          │
│ Document what prevented the outage                        │
│                                                             │
│ METRICS TO TRACK:                                           │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ • Mean Time to Detect (MTTD)                           ││
│ │ • Mean Time to Resolve (MTTR)                          ││
│ │ • Incident frequency                                    ││
│ │ • Recurring incident rate                               ││
│ │ • Postmortem action completion rate                    ││
│ │ • Near-miss to incident ratio                          ││
│ └─────────────────────────────────────────────────────────┘│
│                                                             │
│ GOAL: Get better at detecting, responding, and learning   │
└─────────────────────────────────────────────────────────────┘