9 min read • Guide 723 of 877
Post-Mortem Analysis for Development Teams
Incidents are inevitable - failing to learn from them isn't. GitScrum helps teams document, analyze, and track post-mortem outcomes to build more resilient systems and processes.
Post-Mortem Principles
Blameless Culture
BLAMELESS VS BLAME CULTURE:
┌─────────────────────────────────────────────────────────────┐
│ │
│ BLAME CULTURE: │
│ │
│ "Who pushed the bad code?" │
│ "Why didn't you test this?" │
│ "This is your responsibility" │
│ │
│ RESULT: │
│ • People hide mistakes │
│ • Incidents go unreported │
│ • Root causes stay hidden │
│ • Problems repeat │
│ • Culture of fear │
│ │
│ ─────────────────────────────────────────────────────────── │
│ │
│ BLAMELESS CULTURE: │
│ │
│ "What in our system allowed this to happen?" │
│ "How can we make the right thing easy?" │
│ "What did we learn?" │
│ │
│ RESULT: │
│ • Incidents reported early │
│ • Real root causes found │
│ • Systems improve │
│ • Problems don't repeat │
│ • Culture of learning │
│ │
│ KEY INSIGHT: │
│ "The person who made the 'mistake' is often the person │
│ closest to the problem and best positioned to fix it." │
└─────────────────────────────────────────────────────────────┘
When to Run Post-Mortems
POST-MORTEM TRIGGERS:
┌─────────────────────────────────────────────────────────────┐
│ │
│ DEFINITELY RUN POST-MORTEM: │
│ • Production outage │
│ • Data loss or breach │
│ • Significant user impact │
│ • Missed SLA │
│ • Security incident │
│ │
│ CONSIDER POST-MORTEM: │
│ • Near miss (almost had incident) │
│ • Minor incident with learning value │
│ • Process failure │
│ • Significant missed deadline │
│ │
│ LIGHTER REVIEW: │
│ • Bug that reached production │
│ • Small process issues │
│ • Learning opportunity │
│ │
│ SEVERITY DEFINITIONS: │
│ │
│ SEV-1: Critical - Full post-mortem required │
│ System down, data loss, major impact │
│ │
│ SEV-2: Major - Post-mortem recommended │
│ Partial outage, significant user impact │
│ │
│ SEV-3: Minor - Brief review │
│ Small impact, quick recovery │
│ │
│ SEV-4: Low - Optional │
│ Minimal impact, mostly internal │
└─────────────────────────────────────────────────────────────┘
Post-Mortem Process
Timeline
POST-MORTEM TIMELINE:
┌─────────────────────────────────────────────────────────────┐
│ │
│ INCIDENT │
│ │ │
│ ▼ │
│ RESPOND (Minutes to Hours) │
│ • Mitigate impact │
│ • Communicate status │
│ • Document as you go │
│ │ │
│ ▼ │
│ RESOLVE (Hours to Days) │
│ • Fix the immediate issue │
│ • Verify resolution │
│ • Declare incident closed │
│ │ │
│ ▼ │
│ COOL DOWN (1-3 Days) │
│ • Let emotions settle │
│ • Gather data and logs │
│ • Draft timeline │
│ │ │
│ ▼ │
│ POST-MORTEM (Day 3-5) │
│ • Facilitate meeting │
│ • Analyze root causes │
│ • Document learnings │
│ • Assign action items │
│ │ │
│ ▼ │
│ FOLLOW UP (Ongoing) │
│ • Track action items │
│ • Verify fixes deployed │
│ • Close when complete │
└─────────────────────────────────────────────────────────────┘
Meeting Structure
POST-MORTEM MEETING AGENDA:
┌─────────────────────────────────────────────────────────────┐
│ │
│ DURATION: 45-90 minutes (based on severity) │
│ │
│ ATTENDEES: │
│ • People involved in incident response │
│ • Engineering leadership (for major incidents) │
│ • Facilitator (neutral party preferred) │
│ │
│ AGENDA: │
│ │
│ 1. SET THE STAGE (5 min) │
│ • Read blameless culture reminder │
│ • State the purpose: to learn and improve │
│ │
│ 2. TIMELINE REVIEW (15 min) │
│ • Walk through what happened │
│ • Fill in gaps, correct errors │
│ • Note key decision points │
│ │
│ 3. CONTRIBUTING FACTORS (20 min) │
│ • What conditions led to this? │
│ • Why did each factor exist? │
│ • Keep asking "why" (5 whys technique) │
│ │
│ 4. WHAT WENT WELL (10 min) │
│ • What worked in our response? │
│ • What do we want to keep doing? │
│ │
│ 5. ACTION ITEMS (15 min) │
│ • What will prevent recurrence? │
│ • What will improve detection? │
│ • What will speed up recovery? │
│ • Assign owners and deadlines │
│ │
│ 6. CLOSE │
│ • Confirm action items │
│ • Schedule follow-up if needed │
└─────────────────────────────────────────────────────────────┘
Analysis Techniques
5 Whys
5 WHYS TECHNIQUE:
┌─────────────────────────────────────────────────────────────┐
│ │
│ INCIDENT: Production database crashed │
│ │
│ WHY 1: Why did the database crash? │
│ → Disk space ran out │
│ │
│ WHY 2: Why did disk space run out? │
│ → Log files grew unexpectedly large │
│ │
│ WHY 3: Why did log files grow unexpectedly? │
│ → A new feature was logging too much data │
│ │
│ WHY 4: Why was feature logging too much? │
│ → Debug logging was left enabled in production │
│ │
│ WHY 5: Why was debug logging enabled in production? │
│ → No check in CI/CD to prevent it │
│ │
│ ROOT CAUSE: Missing deployment safeguards │
│ │
│ ACTION ITEMS: │
│ 1. Add CI check to fail on debug logging in prod │
│ 2. Set up disk space monitoring with alerts │
│ 3. Implement log rotation │
│ │
│ TIPS: │
│ • May need fewer or more than 5 whys │
│ • May have multiple branches of causes │
│ • Stop when you reach actionable root causes │
│ • Focus on systems, not people │
└─────────────────────────────────────────────────────────────┘
Contributing Factors
CONTRIBUTING FACTORS ANALYSIS:
┌─────────────────────────────────────────────────────────────┐
│ │
│ FACTOR CATEGORIES: │
│ │
│ TECHNICAL: │
│ • Missing monitoring │
│ • No automatic recovery │
│ • Single point of failure │
│ • Technical debt │
│ │
│ PROCESS: │
│ • Insufficient testing │
│ • Missing runbook │
│ • Unclear ownership │
│ • Communication gaps │
│ │
│ ENVIRONMENTAL: │
│ • Time pressure │
│ • Resource constraints │
│ • Organizational change │
│ • External dependencies │
│ │
│ EXAMPLE MAPPING: │
│ │
│ Incident: 2-hour checkout outage │
│ │
│ Technical: No circuit breaker when payment API failed │
│ Process: No runbook for payment failures │
│ Environmental: Launch deadline pressure skipped testing │
│ │
│ Multiple factors usually combine to cause incidents │
│ Swiss cheese model: Multiple holes had to align │
└─────────────────────────────────────────────────────────────┘
Documentation
Post-Mortem Template
POST-MORTEM DOCUMENT TEMPLATE:
┌─────────────────────────────────────────────────────────────┐
│ │
│ INCIDENT: [Title] │
│ DATE: [When it occurred] │
│ SEVERITY: [SEV-1/2/3/4] │
│ DURATION: [Total incident time] │
│ IMPACT: [Users affected, revenue lost, etc.] │
│ AUTHORS: [Who wrote this] │
│ STATUS: [Draft/Reviewed/Closed] │
│ │
│ ═══════════════════════════════════════════════════════════ │
│ │
│ ## Summary │
│ [2-3 sentence overview of what happened] │
│ │
│ ## Timeline │
│ 14:32 - Alert fired for high error rate │
│ 14:35 - On-call acknowledged │
│ 14:40 - Root cause identified │
│ 15:15 - Fix deployed │
│ 15:20 - Error rate returned to normal │
│ │
│ ## Contributing Factors │
│ [What conditions led to this incident] │
│ │
│ ## Detection │
│ [How was it discovered? Could we detect faster?] │
│ │
│ ## Response │
│ [What did we do? What went well? What was difficult?] │
│ │
│ ## Root Cause │
│ [Underlying cause(s)] │
│ │
│ ## Action Items │
│ | Action | Owner | Due | Status | │
│ |--------|-------|-----|--------| │
│ | ... | ... | ... | ... | │
│ │
│ ## Lessons Learned │
│ [Key takeaways for the organization] │
└─────────────────────────────────────────────────────────────┘
Action Tracking
Following Through
POST-MORTEM ACTION TRACKING:
┌─────────────────────────────────────────────────────────────┐
│ │
│ INCIDENT: Checkout Outage (Jan 15) │
│ │
│ ACTION ITEMS: │
│ │
│ ☑️ Add circuit breaker to payment service │
│ Owner: @alex | Due: Jan 22 | Status: Done │
│ Link: PR #1234 │
│ │
│ 🔄 Create runbook for payment failures │
│ Owner: @maria | Due: Jan 25 | Status: In Progress │
│ Notes: 60% complete, will finish tomorrow │
│ │
│ ☐ Set up payment API monitoring │
│ Owner: @jordan | Due: Jan 29 | Status: Not Started │
│ │
│ ☐ Review test coverage for checkout flow │
│ Owner: @alex | Due: Feb 5 | Status: Not Started │
│ │
│ ═══════════════════════════════════════════════════════════ │
│ │
│ TRACKING RULES: │
│ • Review action items weekly │
│ • Escalate if overdue │
│ • Don't close post-mortem until all actions done │
│ • Link actions to implementation │
│ │
│ METRICS: │
│ • Average time to close all actions: 14 days │
│ • Action completion rate: 92% │
│ • Repeat incidents: 2 (from uncompleted actions) │
└─────────────────────────────────────────────────────────────┘