Incident Response & Postmortems | Blameless Learning
Handle production incidents with structured response and blameless postmortems. GitScrum tracks remediation actions and captures learnings for continuous improvement.
9 min read
Incidents happen. GitScrum helps teams document incidents, track remediation, and capture learnings for continuous improvement.
Incident Response
Response Process
INCIDENT RESPONSE FLOW:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β 1. DETECT β
β ββββββββββ β
β β’ Monitoring alert triggers β
β β’ Customer reports issue β
β β’ Team member notices problem β
β β
β 2. TRIAGE β
β βββββββββ β
β β’ Assess severity (SEV1, SEV2, SEV3) β
β β’ Identify impacted services β
β β’ Page appropriate on-call β
β β
β 3. RESPOND β
β ββββββββββ β
β β’ Assemble incident team β
β β’ Open incident channel β
β β’ Begin investigation β
β β
β 4. MITIGATE β
β βββββββββββ β
β β’ Focus on restoring service β
β β’ Roll back if needed β
β β’ Apply temporary fixes β
β β
β 5. RESOLVE β
β ββββββββββ β
β β’ Confirm service restored β
β β’ Monitor for stability β
β β’ Communicate resolution β
β β
β 6. LEARN β
β ββββββββ β
β β’ Schedule postmortem β
β β’ Document timeline β
β β’ Create action items β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β SEVERITY LEVELS: β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β SEV1 (Critical): ββ
β β Complete outage, all users affected ββ
β β Response: Immediate, all hands ββ
β β ββ
β β SEV2 (Major): ββ
β β Degraded service, many users affected ββ
β β Response: Within 30 min, primary on-call ββ
β β ββ
β β SEV3 (Minor): ββ
β β Limited impact, workaround exists ββ
β β Response: Next business day ββ
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Incident Roles
INCIDENT TEAM ROLES:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β INCIDENT COMMANDER (IC): β
β ββββββββββββββββββββββββ β
β β’ Coordinates response β
β β’ Makes decisions β
β β’ Assigns tasks β
β β’ Communicates with stakeholders β
β β’ Calls for additional help β
β β
β TECHNICAL LEAD: β
β βββββββββββββββ β
β β’ Leads investigation β
β β’ Directs debugging β
β β’ Proposes mitigations β
β β’ Implements fixes β
β β
β COMMUNICATIONS: β
β βββββββββββββββ β
β β’ Posts status updates β
β β’ Updates status page β
β β’ Communicates with customers β
β β’ Keeps stakeholders informed β
β β
β SCRIBE: β
β βββββββ β
β β’ Documents timeline β
β β’ Records decisions β
β β’ Captures what was tried β
β β’ Prepares for postmortem β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β ROLE ASSIGNMENT: β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β INCIDENT: Payment Processing Down ββ
β β SEVERITY: SEV1 ββ
β β TIME: 2025-01-15 14:30 UTC ββ
β β ββ
β β IC: @jordan ββ
β β Tech Lead: @alex ββ
β β Comms: @sam ββ
β β Scribe: @pat ββ
β β ββ
β β Channel: #incident-2025-01-15-payments ββ
β β Status Page: https://status.acme.co ββ
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β Clear roles prevent confusion during chaos β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Postmortem Process
Blameless Postmortems
BLAMELESS POSTMORTEM PRINCIPLES:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β CORE PRINCIPLE: β
β βββββββββββββββ β
β People don't cause incidentsβsystems do β
β Focus on: What allowed this to happen? β
β β
β BLAMELESS LANGUAGE: β
β βββββββββββββββββββ β
β β
β β "John pushed bad code" β
β β
"The deployment pipeline didn't catch the bug" β
β β
β β "Sarah didn't notice the alert" β
β β
"The alert wasn't configured to page" β
β β
β β "The team was careless" β
β β
"The process didn't have sufficient safeguards" β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β WHY BLAMELESS: β
β ββββββββββββββ β
β β
β BLAME: β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β β People hide mistakes ββ
β β β Less information shared ββ
β β β Root causes stay hidden ββ
β β β Incidents recur ββ
β β β Culture of fear ββ
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β BLAMELESS: β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β β People speak openly ββ
β β β Full context available ββ
β β β True root causes found ββ
β β β Systemic fixes implemented ββ
β β β Culture of learning ββ
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β "If someone can make a mistake, the system is fragile" β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Postmortem Template
POSTMORTEM DOCUMENT:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β INCIDENT: Payment Processing Outage β
β DATE: January 15, 2025 β
β DURATION: 47 minutes β
β SEVERITY: SEV1 β
β AUTHOR: @jordan β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β SUMMARY: β
β ββββββββ β
β Payment processing was unavailable for 47 minutes β
β due to database connection pool exhaustion following β
β a traffic spike. β
β β
β IMPACT: β
β βββββββ β
β β’ ~2,500 failed payment attempts β
β β’ Customer support tickets: 143 β
β β’ Estimated revenue impact: $45,000 β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β TIMELINE: β
β βββββββββ β
β 14:23 - Traffic spike begins (marketing campaign) β
β 14:28 - Connection pool warnings in logs β
β 14:32 - First customer reports failed payment β
β 14:35 - Pager duty alert triggers β
β 14:38 - Incident channel created, IC assigned β
β 14:45 - Root cause identified (connection exhaustion) β
β 14:52 - Decision to increase pool size β
β 15:05 - Configuration deployed β
β 15:12 - Payments recovering β
β 15:19 - All systems normal, monitoring β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β ROOT CAUSE: β
β βββββββββββ β
β Database connection pool was sized for normal traffic. β
β Marketing campaign drove 4x normal traffic without β
β advance notice to engineering. β
β β
β CONTRIBUTING FACTORS: β
β βββββββββββββββββββββ β
β β’ No auto-scaling for connection pools β
β β’ Alert threshold too high (triggered late) β
β β’ Marketing/engineering communication gap β
β β’ No load testing for high-traffic scenarios β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β WHAT WENT WELL: β
β ββββββββββββββββ β
β β’ Incident response was quick once triggered β
β β’ Root cause identified in under 10 minutes β
β β’ Fix was straightforward β
β β’ Customer communication was timely β
β β
β WHAT COULD BE IMPROVED: β
β ββββββββββββββββββββββββ β
β β’ Earlier detection (alerts didn't fire soon enough) β
β β’ Cross-team communication about traffic-driving events β
β β’ Auto-scaling configuration β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β ACTION ITEMS: β
β βββββββββββββ β
β 1. Implement auto-scaling for connection pools @alex β
β Due: Jan 22 β
β β
β 2. Lower alert threshold from 80% to 60% @pat β
β Due: Jan 17 β
β β
β 3. Add marketing calendar to engineering @jordan β
β Due: Jan 19 β
β β
β 4. Load test at 5x normal traffic @sam β
β Due: Jan 31 β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Action Item Tracking
Following Through
POSTMORTEM ACTION TRACKING:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β COMMON FAILURE: β
β βββββββββββββββ β
β Postmortems happen, action items get lost β
β Same incident recurs 3 months later β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β TRACKING IN GITSCRUM: β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β PROJECT: Reliability ββ
β β ββ
β β POSTMORTEM ACTION ITEMS ββ
β β βββββββββββββββββββββββββ ββ
β β ββ
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β β [INC-456] Auto-scaling for connection pools βββ
β β β Status: In Progress βββ
β β β Source: Postmortem 2025-01-15 βββ
β β β Priority: P1 βββ
β β β Due: Jan 22 βββ
β β β Owner: @alex βββ
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β ββ
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β β [INC-457] Lower connection pool alert threshold βββ
β β β Status: Done β βββ
β β β Source: Postmortem 2025-01-15 βββ
β β β Completed: Jan 16 βββ
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β ββ
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β β [INC-458] Marketing calendar integration βββ
β β β Status: Not Started βββ
β β β Source: Postmortem 2025-01-15 βββ
β β β Due: Jan 19 βββ
β β β Owner: @jordan βββ
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β WEEKLY REVIEW: β
β ββββββββββββββ β
β β’ Track open postmortem actions β
β β’ Report on completion rate β
β β’ Escalate overdue items β
β β’ Celebrate completed improvements β
β β
β METRICS: β
β β’ Postmortem action completion rate β
β β’ Time to complete actions β
β β’ Recurring incident rate β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Building Resilience
Learning Culture
INCIDENT LEARNING CULTURE:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β SHARE LEARNINGS: β
β ββββββββββββββββ β
β β
β INCIDENT REVIEW MEETINGS: β
β Monthly session to share postmortems across teams β
β Learn from others' incidents β
β β
β INCIDENT NEWSLETTER: β
β Weekly summary of recent incidents β
β Key learnings and action items β
β β
β INCIDENT WIKI: β
β Searchable archive of postmortems β
β Patterns and common issues β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β GAME DAYS: β
β ββββββββββ β
β Practice incident response before real incidents β
β β
β β’ Simulate outage scenarios β
β β’ Practice incident roles β
β β’ Test runbooks β
β β’ Find gaps in monitoring β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β CELEBRATE NEAR-MISSES: β
β ββββββββββββββββββββββ β
β "We caught this before customers noticed" β
β Just as valuable as postmortems β
β Document what prevented the outage β
β β
β METRICS TO TRACK: β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β β’ Mean Time to Detect (MTTD) ββ
β β β’ Mean Time to Resolve (MTTR) ββ
β β β’ Incident frequency ββ
β β β’ Recurring incident rate ββ
β β β’ Postmortem action completion rate ββ
β β β’ Near-miss to incident ratio ββ
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β GOAL: Get better at detecting, responding, and learning β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ