Post-Mortem Analysis for Teams | Blameless Incident Review
Run blameless post-mortems that find systemic issues. Use 5 whys analysis, document contributing factors, and track action items to prevent repeat incidents.
9 min read
Incidents are inevitable - failing to learn from them isn't. GitScrum helps teams document, analyze, and track post-mortem outcomes to build more resilient systems and processes.
Post-Mortem Principles
Blameless Culture
BLAMELESS VS BLAME CULTURE:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β BLAME CULTURE: β
β β
β "Who pushed the bad code?" β
β "Why didn't you test this?" β
β "This is your responsibility" β
β β
β RESULT: β
β β’ People hide mistakes β
β β’ Incidents go unreported β
β β’ Root causes stay hidden β
β β’ Problems repeat β
β β’ Culture of fear β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β BLAMELESS CULTURE: β
β β
β "What in our system allowed this to happen?" β
β "How can we make the right thing easy?" β
β "What did we learn?" β
β β
β RESULT: β
β β’ Incidents reported early β
β β’ Real root causes found β
β β’ Systems improve β
β β’ Problems don't repeat β
β β’ Culture of learning β
β β
β KEY INSIGHT: β
β "The person who made the 'mistake' is often the person β
β closest to the problem and best positioned to fix it." β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
When to Run Post-Mortems
POST-MORTEM TRIGGERS:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β DEFINITELY RUN POST-MORTEM: β
β β’ Production outage β
β β’ Data loss or breach β
β β’ Significant user impact β
β β’ Missed SLA β
β β’ Security incident β
β β
β CONSIDER POST-MORTEM: β
β β’ Near miss (almost had incident) β
β β’ Minor incident with learning value β
β β’ Process failure β
β β’ Significant missed deadline β
β β
β LIGHTER REVIEW: β
β β’ Bug that reached production β
β β’ Small process issues β
β β’ Learning opportunity β
β β
β SEVERITY DEFINITIONS: β
β β
β SEV-1: Critical - Full post-mortem required β
β System down, data loss, major impact β
β β
β SEV-2: Major - Post-mortem recommended β
β Partial outage, significant user impact β
β β
β SEV-3: Minor - Brief review β
β Small impact, quick recovery β
β β
β SEV-4: Low - Optional β
β Minimal impact, mostly internal β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Post-Mortem Process
Timeline
POST-MORTEM TIMELINE:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β INCIDENT β
β β β
β βΌ β
β RESPOND (Minutes to Hours) β
β β’ Mitigate impact β
β β’ Communicate status β
β β’ Document as you go β
β β β
β βΌ β
β RESOLVE (Hours to Days) β
β β’ Fix the immediate issue β
β β’ Verify resolution β
β β’ Declare incident closed β
β β β
β βΌ β
β COOL DOWN (1-3 Days) β
β β’ Let emotions settle β
β β’ Gather data and logs β
β β’ Draft timeline β
β β β
β βΌ β
β POST-MORTEM (Day 3-5) β
β β’ Facilitate meeting β
β β’ Analyze root causes β
β β’ Document learnings β
β β’ Assign action items β
β β β
β βΌ β
β FOLLOW UP (Ongoing) β
β β’ Track action items β
β β’ Verify fixes deployed β
β β’ Close when complete β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Meeting Structure
POST-MORTEM MEETING AGENDA:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β DURATION: 45-90 minutes (based on severity) β
β β
β ATTENDEES: β
β β’ People involved in incident response β
β β’ Engineering leadership (for major incidents) β
β β’ Facilitator (neutral party preferred) β
β β
β AGENDA: β
β β
β 1. SET THE STAGE (5 min) β
β β’ Read blameless culture reminder β
β β’ State the purpose: to learn and improve β
β β
β 2. TIMELINE REVIEW (15 min) β
β β’ Walk through what happened β
β β’ Fill in gaps, correct errors β
β β’ Note key decision points β
β β
β 3. CONTRIBUTING FACTORS (20 min) β
β β’ What conditions led to this? β
β β’ Why did each factor exist? β
β β’ Keep asking "why" (5 whys technique) β
β β
β 4. WHAT WENT WELL (10 min) β
β β’ What worked in our response? β
β β’ What do we want to keep doing? β
β β
β 5. ACTION ITEMS (15 min) β
β β’ What will prevent recurrence? β
β β’ What will improve detection? β
β β’ What will speed up recovery? β
β β’ Assign owners and deadlines β
β β
β 6. CLOSE β
β β’ Confirm action items β
β β’ Schedule follow-up if needed β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Analysis Techniques
5 Whys
5 WHYS TECHNIQUE:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β INCIDENT: Production database crashed β
β β
β WHY 1: Why did the database crash? β
β β Disk space ran out β
β β
β WHY 2: Why did disk space run out? β
β β Log files grew unexpectedly large β
β β
β WHY 3: Why did log files grow unexpectedly? β
β β A new feature was logging too much data β
β β
β WHY 4: Why was feature logging too much? β
β β Debug logging was left enabled in production β
β β
β WHY 5: Why was debug logging enabled in production? β
β β No check in CI/CD to prevent it β
β β
β ROOT CAUSE: Missing deployment safeguards β
β β
β ACTION ITEMS: β
β 1. Add CI check to fail on debug logging in prod β
β 2. Set up disk space monitoring with alerts β
β 3. Implement log rotation β
β β
β TIPS: β
β β’ May need fewer or more than 5 whys β
β β’ May have multiple branches of causes β
β β’ Stop when you reach actionable root causes β
β β’ Focus on systems, not people β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Contributing Factors
CONTRIBUTING FACTORS ANALYSIS:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β FACTOR CATEGORIES: β
β β
β TECHNICAL: β
β β’ Missing monitoring β
β β’ No automatic recovery β
β β’ Single point of failure β
β β’ Technical debt β
β β
β PROCESS: β
β β’ Insufficient testing β
β β’ Missing runbook β
β β’ Unclear ownership β
β β’ Communication gaps β
β β
β ENVIRONMENTAL: β
β β’ Time pressure β
β β’ Resource constraints β
β β’ Organizational change β
β β’ External dependencies β
β β
β EXAMPLE MAPPING: β
β β
β Incident: 2-hour checkout outage β
β β
β Technical: No circuit breaker when payment API failed β
β Process: No runbook for payment failures β
β Environmental: Launch deadline pressure skipped testing β
β β
β Multiple factors usually combine to cause incidents β
β Swiss cheese model: Multiple holes had to align β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Documentation
Post-Mortem Template
POST-MORTEM DOCUMENT TEMPLATE:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β INCIDENT: [Title] β
β DATE: [When it occurred] β
β SEVERITY: [SEV-1/2/3/4] β
β DURATION: [Total incident time] β
β IMPACT: [Users affected, revenue lost, etc.] β
β AUTHORS: [Who wrote this] β
β STATUS: [Draft/Reviewed/Closed] β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β ## Summary β
β [2-3 sentence overview of what happened] β
β β
β ## Timeline β
β 14:32 - Alert fired for high error rate β
β 14:35 - On-call acknowledged β
β 14:40 - Root cause identified β
β 15:15 - Fix deployed β
β 15:20 - Error rate returned to normal β
β β
β ## Contributing Factors β
β [What conditions led to this incident] β
β β
β ## Detection β
β [How was it discovered? Could we detect faster?] β
β β
β ## Response β
β [What did we do? What went well? What was difficult?] β
β β
β ## Root Cause β
β [Underlying cause(s)] β
β β
β ## Action Items β
β | Action | Owner | Due | Status | β
β |--------|-------|-----|--------| β
β | ... | ... | ... | ... | β
β β
β ## Lessons Learned β
β [Key takeaways for the organization] β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Action Tracking
Following Through
POST-MORTEM ACTION TRACKING:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β INCIDENT: Checkout Outage (Jan 15) β
β β
β ACTION ITEMS: β
β β
β βοΈ Add circuit breaker to payment service β
β Owner: @alex | Due: Jan 22 | Status: Done β
β Link: PR #1234 β
β β
β π Create runbook for payment failures β
β Owner: @maria | Due: Jan 25 | Status: In Progress β
β Notes: 60% complete, will finish tomorrow β
β β
β β Set up payment API monitoring β
β Owner: @jordan | Due: Jan 29 | Status: Not Started β
β β
β β Review test coverage for checkout flow β
β Owner: @alex | Due: Feb 5 | Status: Not Started β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β TRACKING RULES: β
β β’ Review action items weekly β
β β’ Escalate if overdue β
β β’ Don't close post-mortem until all actions done β
β β’ Link actions to implementation β
β β
β METRICS: β
β β’ Average time to close all actions: 14 days β
β β’ Action completion rate: 92% β
β β’ Repeat incidents: 2 (from uncompleted actions) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ