Try free
9 min read Guide 723 of 877

Post-Mortem Analysis for Development Teams

Incidents are inevitable - failing to learn from them isn't. GitScrum helps teams document, analyze, and track post-mortem outcomes to build more resilient systems and processes.

Post-Mortem Principles

Blameless Culture

BLAMELESS VS BLAME CULTURE:
┌─────────────────────────────────────────────────────────────┐
│                                                             │
│ BLAME CULTURE:                                              │
│                                                             │
│ "Who pushed the bad code?"                                 │
│ "Why didn't you test this?"                                │
│ "This is your responsibility"                              │
│                                                             │
│ RESULT:                                                     │
│ • People hide mistakes                                     │
│ • Incidents go unreported                                  │
│ • Root causes stay hidden                                  │
│ • Problems repeat                                          │
│ • Culture of fear                                          │
│                                                             │
│ ─────────────────────────────────────────────────────────── │
│                                                             │
│ BLAMELESS CULTURE:                                          │
│                                                             │
│ "What in our system allowed this to happen?"               │
│ "How can we make the right thing easy?"                    │
│ "What did we learn?"                                       │
│                                                             │
│ RESULT:                                                     │
│ • Incidents reported early                                 │
│ • Real root causes found                                   │
│ • Systems improve                                          │
│ • Problems don't repeat                                    │
│ • Culture of learning                                      │
│                                                             │
│ KEY INSIGHT:                                                │
│ "The person who made the 'mistake' is often the person    │
│ closest to the problem and best positioned to fix it."     │
└─────────────────────────────────────────────────────────────┘

When to Run Post-Mortems

POST-MORTEM TRIGGERS:
┌─────────────────────────────────────────────────────────────┐
│                                                             │
│ DEFINITELY RUN POST-MORTEM:                                 │
│ • Production outage                                        │
│ • Data loss or breach                                      │
│ • Significant user impact                                  │
│ • Missed SLA                                               │
│ • Security incident                                        │
│                                                             │
│ CONSIDER POST-MORTEM:                                       │
│ • Near miss (almost had incident)                          │
│ • Minor incident with learning value                       │
│ • Process failure                                          │
│ • Significant missed deadline                              │
│                                                             │
│ LIGHTER REVIEW:                                             │
│ • Bug that reached production                              │
│ • Small process issues                                     │
│ • Learning opportunity                                     │
│                                                             │
│ SEVERITY DEFINITIONS:                                       │
│                                                             │
│ SEV-1: Critical - Full post-mortem required               │
│        System down, data loss, major impact                │
│                                                             │
│ SEV-2: Major - Post-mortem recommended                    │
│        Partial outage, significant user impact             │
│                                                             │
│ SEV-3: Minor - Brief review                               │
│        Small impact, quick recovery                        │
│                                                             │
│ SEV-4: Low - Optional                                      │
│        Minimal impact, mostly internal                     │
└─────────────────────────────────────────────────────────────┘

Post-Mortem Process

Timeline

POST-MORTEM TIMELINE:
┌─────────────────────────────────────────────────────────────┐
│                                                             │
│ INCIDENT                                                    │
│    │                                                        │
│    ▼                                                        │
│ RESPOND (Minutes to Hours)                                  │
│    • Mitigate impact                                       │
│    • Communicate status                                    │
│    • Document as you go                                    │
│    │                                                        │
│    ▼                                                        │
│ RESOLVE (Hours to Days)                                     │
│    • Fix the immediate issue                               │
│    • Verify resolution                                     │
│    • Declare incident closed                               │
│    │                                                        │
│    ▼                                                        │
│ COOL DOWN (1-3 Days)                                        │
│    • Let emotions settle                                   │
│    • Gather data and logs                                  │
│    • Draft timeline                                        │
│    │                                                        │
│    ▼                                                        │
│ POST-MORTEM (Day 3-5)                                       │
│    • Facilitate meeting                                    │
│    • Analyze root causes                                   │
│    • Document learnings                                    │
│    • Assign action items                                   │
│    │                                                        │
│    ▼                                                        │
│ FOLLOW UP (Ongoing)                                         │
│    • Track action items                                    │
│    • Verify fixes deployed                                 │
│    • Close when complete                                   │
└─────────────────────────────────────────────────────────────┘

Meeting Structure

POST-MORTEM MEETING AGENDA:
┌─────────────────────────────────────────────────────────────┐
│                                                             │
│ DURATION: 45-90 minutes (based on severity)                │
│                                                             │
│ ATTENDEES:                                                  │
│ • People involved in incident response                     │
│ • Engineering leadership (for major incidents)             │
│ • Facilitator (neutral party preferred)                    │
│                                                             │
│ AGENDA:                                                     │
│                                                             │
│ 1. SET THE STAGE (5 min)                                    │
│    • Read blameless culture reminder                       │
│    • State the purpose: to learn and improve               │
│                                                             │
│ 2. TIMELINE REVIEW (15 min)                                 │
│    • Walk through what happened                            │
│    • Fill in gaps, correct errors                          │
│    • Note key decision points                              │
│                                                             │
│ 3. CONTRIBUTING FACTORS (20 min)                            │
│    • What conditions led to this?                          │
│    • Why did each factor exist?                            │
│    • Keep asking "why" (5 whys technique)                  │
│                                                             │
│ 4. WHAT WENT WELL (10 min)                                  │
│    • What worked in our response?                          │
│    • What do we want to keep doing?                        │
│                                                             │
│ 5. ACTION ITEMS (15 min)                                    │
│    • What will prevent recurrence?                         │
│    • What will improve detection?                          │
│    • What will speed up recovery?                          │
│    • Assign owners and deadlines                           │
│                                                             │
│ 6. CLOSE                                                    │
│    • Confirm action items                                  │
│    • Schedule follow-up if needed                          │
└─────────────────────────────────────────────────────────────┘

Analysis Techniques

5 Whys

5 WHYS TECHNIQUE:
┌─────────────────────────────────────────────────────────────┐
│                                                             │
│ INCIDENT: Production database crashed                      │
│                                                             │
│ WHY 1: Why did the database crash?                         │
│ → Disk space ran out                                       │
│                                                             │
│ WHY 2: Why did disk space run out?                         │
│ → Log files grew unexpectedly large                        │
│                                                             │
│ WHY 3: Why did log files grow unexpectedly?               │
│ → A new feature was logging too much data                  │
│                                                             │
│ WHY 4: Why was feature logging too much?                   │
│ → Debug logging was left enabled in production             │
│                                                             │
│ WHY 5: Why was debug logging enabled in production?        │
│ → No check in CI/CD to prevent it                          │
│                                                             │
│ ROOT CAUSE: Missing deployment safeguards                  │
│                                                             │
│ ACTION ITEMS:                                               │
│ 1. Add CI check to fail on debug logging in prod          │
│ 2. Set up disk space monitoring with alerts               │
│ 3. Implement log rotation                                  │
│                                                             │
│ TIPS:                                                       │
│ • May need fewer or more than 5 whys                       │
│ • May have multiple branches of causes                     │
│ • Stop when you reach actionable root causes               │
│ • Focus on systems, not people                             │
└─────────────────────────────────────────────────────────────┘

Contributing Factors

CONTRIBUTING FACTORS ANALYSIS:
┌─────────────────────────────────────────────────────────────┐
│                                                             │
│ FACTOR CATEGORIES:                                          │
│                                                             │
│ TECHNICAL:                                                  │
│ • Missing monitoring                                       │
│ • No automatic recovery                                    │
│ • Single point of failure                                  │
│ • Technical debt                                           │
│                                                             │
│ PROCESS:                                                    │
│ • Insufficient testing                                     │
│ • Missing runbook                                          │
│ • Unclear ownership                                        │
│ • Communication gaps                                       │
│                                                             │
│ ENVIRONMENTAL:                                              │
│ • Time pressure                                            │
│ • Resource constraints                                     │
│ • Organizational change                                    │
│ • External dependencies                                    │
│                                                             │
│ EXAMPLE MAPPING:                                            │
│                                                             │
│ Incident: 2-hour checkout outage                           │
│                                                             │
│ Technical: No circuit breaker when payment API failed     │
│ Process: No runbook for payment failures                  │
│ Environmental: Launch deadline pressure skipped testing   │
│                                                             │
│ Multiple factors usually combine to cause incidents        │
│ Swiss cheese model: Multiple holes had to align            │
└─────────────────────────────────────────────────────────────┘

Documentation

Post-Mortem Template

POST-MORTEM DOCUMENT TEMPLATE:
┌─────────────────────────────────────────────────────────────┐
│                                                             │
│ INCIDENT: [Title]                                          │
│ DATE: [When it occurred]                                   │
│ SEVERITY: [SEV-1/2/3/4]                                    │
│ DURATION: [Total incident time]                            │
│ IMPACT: [Users affected, revenue lost, etc.]              │
│ AUTHORS: [Who wrote this]                                  │
│ STATUS: [Draft/Reviewed/Closed]                            │
│                                                             │
│ ═══════════════════════════════════════════════════════════ │
│                                                             │
│ ## Summary                                                  │
│ [2-3 sentence overview of what happened]                   │
│                                                             │
│ ## Timeline                                                 │
│ 14:32 - Alert fired for high error rate                   │
│ 14:35 - On-call acknowledged                               │
│ 14:40 - Root cause identified                              │
│ 15:15 - Fix deployed                                       │
│ 15:20 - Error rate returned to normal                      │
│                                                             │
│ ## Contributing Factors                                     │
│ [What conditions led to this incident]                     │
│                                                             │
│ ## Detection                                                │
│ [How was it discovered? Could we detect faster?]           │
│                                                             │
│ ## Response                                                 │
│ [What did we do? What went well? What was difficult?]      │
│                                                             │
│ ## Root Cause                                               │
│ [Underlying cause(s)]                                      │
│                                                             │
│ ## Action Items                                             │
│ | Action | Owner | Due | Status |                          │
│ |--------|-------|-----|--------|                          │
│ | ...    | ...   | ... | ...    |                          │
│                                                             │
│ ## Lessons Learned                                          │
│ [Key takeaways for the organization]                       │
└─────────────────────────────────────────────────────────────┘

Action Tracking

Following Through

POST-MORTEM ACTION TRACKING:
┌─────────────────────────────────────────────────────────────┐
│                                                             │
│ INCIDENT: Checkout Outage (Jan 15)                         │
│                                                             │
│ ACTION ITEMS:                                               │
│                                                             │
│ ☑️ Add circuit breaker to payment service                  │
│    Owner: @alex | Due: Jan 22 | Status: Done              │
│    Link: PR #1234                                          │
│                                                             │
│ 🔄 Create runbook for payment failures                     │
│    Owner: @maria | Due: Jan 25 | Status: In Progress      │
│    Notes: 60% complete, will finish tomorrow               │
│                                                             │
│ ☐ Set up payment API monitoring                            │
│    Owner: @jordan | Due: Jan 29 | Status: Not Started     │
│                                                             │
│ ☐ Review test coverage for checkout flow                   │
│    Owner: @alex | Due: Feb 5 | Status: Not Started        │
│                                                             │
│ ═══════════════════════════════════════════════════════════ │
│                                                             │
│ TRACKING RULES:                                             │
│ • Review action items weekly                               │
│ • Escalate if overdue                                      │
│ • Don't close post-mortem until all actions done          │
│ • Link actions to implementation                           │
│                                                             │
│ METRICS:                                                    │
│ • Average time to close all actions: 14 days              │
│ • Action completion rate: 92%                              │
│ • Repeat incidents: 2 (from uncompleted actions)          │
└─────────────────────────────────────────────────────────────┘