GitScrum / Docs
All Best Practices

Post-Mortem Analysis for Teams | Blameless Incident Review

Run blameless post-mortems that find systemic issues. Use 5 whys analysis, document contributing factors, and track action items to prevent repeat incidents.

9 min read

Incidents are inevitable - failing to learn from them isn't. GitScrum helps teams document, analyze, and track post-mortem outcomes to build more resilient systems and processes.

Post-Mortem Principles

Blameless Culture

BLAMELESS VS BLAME CULTURE:
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                                                             β”‚
β”‚ BLAME CULTURE:                                              β”‚
β”‚                                                             β”‚
β”‚ "Who pushed the bad code?"                                 β”‚
β”‚ "Why didn't you test this?"                                β”‚
β”‚ "This is your responsibility"                              β”‚
β”‚                                                             β”‚
β”‚ RESULT:                                                     β”‚
β”‚ β€’ People hide mistakes                                     β”‚
β”‚ β€’ Incidents go unreported                                  β”‚
β”‚ β€’ Root causes stay hidden                                  β”‚
β”‚ β€’ Problems repeat                                          β”‚
β”‚ β€’ Culture of fear                                          β”‚
β”‚                                                             β”‚
β”‚ ─────────────────────────────────────────────────────────── β”‚
β”‚                                                             β”‚
β”‚ BLAMELESS CULTURE:                                          β”‚
β”‚                                                             β”‚
β”‚ "What in our system allowed this to happen?"               β”‚
β”‚ "How can we make the right thing easy?"                    β”‚
β”‚ "What did we learn?"                                       β”‚
β”‚                                                             β”‚
β”‚ RESULT:                                                     β”‚
β”‚ β€’ Incidents reported early                                 β”‚
β”‚ β€’ Real root causes found                                   β”‚
β”‚ β€’ Systems improve                                          β”‚
β”‚ β€’ Problems don't repeat                                    β”‚
β”‚ β€’ Culture of learning                                      β”‚
β”‚                                                             β”‚
β”‚ KEY INSIGHT:                                                β”‚
β”‚ "The person who made the 'mistake' is often the person    β”‚
β”‚ closest to the problem and best positioned to fix it."     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

When to Run Post-Mortems

POST-MORTEM TRIGGERS:
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                                                             β”‚
β”‚ DEFINITELY RUN POST-MORTEM:                                 β”‚
β”‚ β€’ Production outage                                        β”‚
β”‚ β€’ Data loss or breach                                      β”‚
β”‚ β€’ Significant user impact                                  β”‚
β”‚ β€’ Missed SLA                                               β”‚
β”‚ β€’ Security incident                                        β”‚
β”‚                                                             β”‚
β”‚ CONSIDER POST-MORTEM:                                       β”‚
β”‚ β€’ Near miss (almost had incident)                          β”‚
β”‚ β€’ Minor incident with learning value                       β”‚
β”‚ β€’ Process failure                                          β”‚
β”‚ β€’ Significant missed deadline                              β”‚
β”‚                                                             β”‚
β”‚ LIGHTER REVIEW:                                             β”‚
β”‚ β€’ Bug that reached production                              β”‚
β”‚ β€’ Small process issues                                     β”‚
β”‚ β€’ Learning opportunity                                     β”‚
β”‚                                                             β”‚
β”‚ SEVERITY DEFINITIONS:                                       β”‚
β”‚                                                             β”‚
β”‚ SEV-1: Critical - Full post-mortem required               β”‚
β”‚        System down, data loss, major impact                β”‚
β”‚                                                             β”‚
β”‚ SEV-2: Major - Post-mortem recommended                    β”‚
β”‚        Partial outage, significant user impact             β”‚
β”‚                                                             β”‚
β”‚ SEV-3: Minor - Brief review                               β”‚
β”‚        Small impact, quick recovery                        β”‚
β”‚                                                             β”‚
β”‚ SEV-4: Low - Optional                                      β”‚
β”‚        Minimal impact, mostly internal                     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Post-Mortem Process

Timeline

POST-MORTEM TIMELINE:
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                                                             β”‚
β”‚ INCIDENT                                                    β”‚
β”‚    β”‚                                                        β”‚
β”‚    β–Ό                                                        β”‚
β”‚ RESPOND (Minutes to Hours)                                  β”‚
β”‚    β€’ Mitigate impact                                       β”‚
β”‚    β€’ Communicate status                                    β”‚
β”‚    β€’ Document as you go                                    β”‚
β”‚    β”‚                                                        β”‚
β”‚    β–Ό                                                        β”‚
β”‚ RESOLVE (Hours to Days)                                     β”‚
β”‚    β€’ Fix the immediate issue                               β”‚
β”‚    β€’ Verify resolution                                     β”‚
β”‚    β€’ Declare incident closed                               β”‚
β”‚    β”‚                                                        β”‚
β”‚    β–Ό                                                        β”‚
β”‚ COOL DOWN (1-3 Days)                                        β”‚
β”‚    β€’ Let emotions settle                                   β”‚
β”‚    β€’ Gather data and logs                                  β”‚
β”‚    β€’ Draft timeline                                        β”‚
β”‚    β”‚                                                        β”‚
β”‚    β–Ό                                                        β”‚
β”‚ POST-MORTEM (Day 3-5)                                       β”‚
β”‚    β€’ Facilitate meeting                                    β”‚
β”‚    β€’ Analyze root causes                                   β”‚
β”‚    β€’ Document learnings                                    β”‚
β”‚    β€’ Assign action items                                   β”‚
β”‚    β”‚                                                        β”‚
β”‚    β–Ό                                                        β”‚
β”‚ FOLLOW UP (Ongoing)                                         β”‚
β”‚    β€’ Track action items                                    β”‚
β”‚    β€’ Verify fixes deployed                                 β”‚
β”‚    β€’ Close when complete                                   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Meeting Structure

POST-MORTEM MEETING AGENDA:
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                                                             β”‚
β”‚ DURATION: 45-90 minutes (based on severity)                β”‚
β”‚                                                             β”‚
β”‚ ATTENDEES:                                                  β”‚
β”‚ β€’ People involved in incident response                     β”‚
β”‚ β€’ Engineering leadership (for major incidents)             β”‚
β”‚ β€’ Facilitator (neutral party preferred)                    β”‚
β”‚                                                             β”‚
β”‚ AGENDA:                                                     β”‚
β”‚                                                             β”‚
β”‚ 1. SET THE STAGE (5 min)                                    β”‚
β”‚    β€’ Read blameless culture reminder                       β”‚
β”‚    β€’ State the purpose: to learn and improve               β”‚
β”‚                                                             β”‚
β”‚ 2. TIMELINE REVIEW (15 min)                                 β”‚
β”‚    β€’ Walk through what happened                            β”‚
β”‚    β€’ Fill in gaps, correct errors                          β”‚
β”‚    β€’ Note key decision points                              β”‚
β”‚                                                             β”‚
β”‚ 3. CONTRIBUTING FACTORS (20 min)                            β”‚
β”‚    β€’ What conditions led to this?                          β”‚
β”‚    β€’ Why did each factor exist?                            β”‚
β”‚    β€’ Keep asking "why" (5 whys technique)                  β”‚
β”‚                                                             β”‚
β”‚ 4. WHAT WENT WELL (10 min)                                  β”‚
β”‚    β€’ What worked in our response?                          β”‚
β”‚    β€’ What do we want to keep doing?                        β”‚
β”‚                                                             β”‚
β”‚ 5. ACTION ITEMS (15 min)                                    β”‚
β”‚    β€’ What will prevent recurrence?                         β”‚
β”‚    β€’ What will improve detection?                          β”‚
β”‚    β€’ What will speed up recovery?                          β”‚
β”‚    β€’ Assign owners and deadlines                           β”‚
β”‚                                                             β”‚
β”‚ 6. CLOSE                                                    β”‚
β”‚    β€’ Confirm action items                                  β”‚
β”‚    β€’ Schedule follow-up if needed                          β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Analysis Techniques

5 Whys

5 WHYS TECHNIQUE:
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                                                             β”‚
β”‚ INCIDENT: Production database crashed                      β”‚
β”‚                                                             β”‚
β”‚ WHY 1: Why did the database crash?                         β”‚
β”‚ β†’ Disk space ran out                                       β”‚
β”‚                                                             β”‚
β”‚ WHY 2: Why did disk space run out?                         β”‚
β”‚ β†’ Log files grew unexpectedly large                        β”‚
β”‚                                                             β”‚
β”‚ WHY 3: Why did log files grow unexpectedly?               β”‚
β”‚ β†’ A new feature was logging too much data                  β”‚
β”‚                                                             β”‚
β”‚ WHY 4: Why was feature logging too much?                   β”‚
β”‚ β†’ Debug logging was left enabled in production             β”‚
β”‚                                                             β”‚
β”‚ WHY 5: Why was debug logging enabled in production?        β”‚
β”‚ β†’ No check in CI/CD to prevent it                          β”‚
β”‚                                                             β”‚
β”‚ ROOT CAUSE: Missing deployment safeguards                  β”‚
β”‚                                                             β”‚
β”‚ ACTION ITEMS:                                               β”‚
β”‚ 1. Add CI check to fail on debug logging in prod          β”‚
β”‚ 2. Set up disk space monitoring with alerts               β”‚
β”‚ 3. Implement log rotation                                  β”‚
β”‚                                                             β”‚
β”‚ TIPS:                                                       β”‚
β”‚ β€’ May need fewer or more than 5 whys                       β”‚
β”‚ β€’ May have multiple branches of causes                     β”‚
β”‚ β€’ Stop when you reach actionable root causes               β”‚
β”‚ β€’ Focus on systems, not people                             β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Contributing Factors

CONTRIBUTING FACTORS ANALYSIS:
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                                                             β”‚
β”‚ FACTOR CATEGORIES:                                          β”‚
β”‚                                                             β”‚
β”‚ TECHNICAL:                                                  β”‚
β”‚ β€’ Missing monitoring                                       β”‚
β”‚ β€’ No automatic recovery                                    β”‚
β”‚ β€’ Single point of failure                                  β”‚
β”‚ β€’ Technical debt                                           β”‚
β”‚                                                             β”‚
β”‚ PROCESS:                                                    β”‚
β”‚ β€’ Insufficient testing                                     β”‚
β”‚ β€’ Missing runbook                                          β”‚
β”‚ β€’ Unclear ownership                                        β”‚
β”‚ β€’ Communication gaps                                       β”‚
β”‚                                                             β”‚
β”‚ ENVIRONMENTAL:                                              β”‚
β”‚ β€’ Time pressure                                            β”‚
β”‚ β€’ Resource constraints                                     β”‚
β”‚ β€’ Organizational change                                    β”‚
β”‚ β€’ External dependencies                                    β”‚
β”‚                                                             β”‚
β”‚ EXAMPLE MAPPING:                                            β”‚
β”‚                                                             β”‚
β”‚ Incident: 2-hour checkout outage                           β”‚
β”‚                                                             β”‚
β”‚ Technical: No circuit breaker when payment API failed     β”‚
β”‚ Process: No runbook for payment failures                  β”‚
β”‚ Environmental: Launch deadline pressure skipped testing   β”‚
β”‚                                                             β”‚
β”‚ Multiple factors usually combine to cause incidents        β”‚
β”‚ Swiss cheese model: Multiple holes had to align            β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Documentation

Post-Mortem Template

POST-MORTEM DOCUMENT TEMPLATE:
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                                                             β”‚
β”‚ INCIDENT: [Title]                                          β”‚
β”‚ DATE: [When it occurred]                                   β”‚
β”‚ SEVERITY: [SEV-1/2/3/4]                                    β”‚
β”‚ DURATION: [Total incident time]                            β”‚
β”‚ IMPACT: [Users affected, revenue lost, etc.]              β”‚
β”‚ AUTHORS: [Who wrote this]                                  β”‚
β”‚ STATUS: [Draft/Reviewed/Closed]                            β”‚
β”‚                                                             β”‚
β”‚ ═══════════════════════════════════════════════════════════ β”‚
β”‚                                                             β”‚
β”‚ ## Summary                                                  β”‚
β”‚ [2-3 sentence overview of what happened]                   β”‚
β”‚                                                             β”‚
β”‚ ## Timeline                                                 β”‚
β”‚ 14:32 - Alert fired for high error rate                   β”‚
β”‚ 14:35 - On-call acknowledged                               β”‚
β”‚ 14:40 - Root cause identified                              β”‚
β”‚ 15:15 - Fix deployed                                       β”‚
β”‚ 15:20 - Error rate returned to normal                      β”‚
β”‚                                                             β”‚
β”‚ ## Contributing Factors                                     β”‚
β”‚ [What conditions led to this incident]                     β”‚
β”‚                                                             β”‚
β”‚ ## Detection                                                β”‚
β”‚ [How was it discovered? Could we detect faster?]           β”‚
β”‚                                                             β”‚
β”‚ ## Response                                                 β”‚
β”‚ [What did we do? What went well? What was difficult?]      β”‚
β”‚                                                             β”‚
β”‚ ## Root Cause                                               β”‚
β”‚ [Underlying cause(s)]                                      β”‚
β”‚                                                             β”‚
β”‚ ## Action Items                                             β”‚
β”‚ | Action | Owner | Due | Status |                          β”‚
β”‚ |--------|-------|-----|--------|                          β”‚
β”‚ | ...    | ...   | ... | ...    |                          β”‚
β”‚                                                             β”‚
β”‚ ## Lessons Learned                                          β”‚
β”‚ [Key takeaways for the organization]                       β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Action Tracking

Following Through

POST-MORTEM ACTION TRACKING:
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                                                             β”‚
β”‚ INCIDENT: Checkout Outage (Jan 15)                         β”‚
β”‚                                                             β”‚
β”‚ ACTION ITEMS:                                               β”‚
β”‚                                                             β”‚
β”‚ β˜‘οΈ Add circuit breaker to payment service                  β”‚
β”‚    Owner: @alex | Due: Jan 22 | Status: Done              β”‚
β”‚    Link: PR #1234                                          β”‚
β”‚                                                             β”‚
β”‚ πŸ”„ Create runbook for payment failures                     β”‚
β”‚    Owner: @maria | Due: Jan 25 | Status: In Progress      β”‚
β”‚    Notes: 60% complete, will finish tomorrow               β”‚
β”‚                                                             β”‚
β”‚ ☐ Set up payment API monitoring                            β”‚
β”‚    Owner: @jordan | Due: Jan 29 | Status: Not Started     β”‚
β”‚                                                             β”‚
β”‚ ☐ Review test coverage for checkout flow                   β”‚
β”‚    Owner: @alex | Due: Feb 5 | Status: Not Started        β”‚
β”‚                                                             β”‚
β”‚ ═══════════════════════════════════════════════════════════ β”‚
β”‚                                                             β”‚
β”‚ TRACKING RULES:                                             β”‚
β”‚ β€’ Review action items weekly                               β”‚
β”‚ β€’ Escalate if overdue                                      β”‚
β”‚ β€’ Don't close post-mortem until all actions done          β”‚
β”‚ β€’ Link actions to implementation                           β”‚
β”‚                                                             β”‚
β”‚ METRICS:                                                    β”‚
β”‚ β€’ Average time to close all actions: 14 days              β”‚
β”‚ β€’ Action completion rate: 92%                              β”‚
β”‚ β€’ Repeat incidents: 2 (from uncompleted actions)          β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Related Solutions