9 min read • Guide 755 of 877
Incident Management with GitScrum
When things break, fast response matters. GitScrum helps teams coordinate incident response and document learnings for future prevention.
Incident Categories
Severity Levels
INCIDENT SEVERITY CLASSIFICATION:
┌─────────────────────────────────────────────────────────────┐
│ │
│ SEV 1 - CRITICAL │
│ 🔴 Complete outage or data loss │
│ • All users affected │
│ • Core functionality broken │
│ • Security breach │
│ Response: All hands, immediate escalation │
│ SLA: Acknowledge < 15 min, resolve ASAP │
│ │
│ SEV 2 - HIGH │
│ 🟠 Major feature unavailable │
│ • Many users affected │
│ • Significant functionality broken │
│ • Workaround exists but painful │
│ Response: On-call + relevant team │
│ SLA: Acknowledge < 1 hour, resolve < 4 hours │
│ │
│ SEV 3 - MEDIUM │
│ 🟡 Feature degraded │
│ • Some users affected │
│ • Workaround exists │
│ • Non-critical feature │
│ Response: On-call, escalate if needed │
│ SLA: Acknowledge < 4 hours, resolve < 24 hours │
│ │
│ SEV 4 - LOW │
│ 🟢 Minor issue │
│ • Few users affected │
│ • Easy workaround │
│ Response: Normal bug process │
│ SLA: Triage < 24 hours │
└─────────────────────────────────────────────────────────────┘
Incident Response
Response Flow
INCIDENT RESPONSE PROCESS:
┌─────────────────────────────────────────────────────────────┐
│ │
│ 1. DETECTION │
│ ↓ │
│ • Monitoring alert fires │
│ • User reports issue │
│ • Team member notices problem │
│ │
│ 2. TRIAGE (< 15 min for SEV 1) │
│ ↓ │
│ • Assess severity │
│ • Assign incident commander │
│ • Create incident task in GitScrum │
│ • Notify stakeholders │
│ │
│ 3. INVESTIGATION │
│ ↓ │
│ • Gather team (war room if needed) │
│ • Review logs, metrics, recent changes │
│ • Identify root cause │
│ • Document findings in real-time │
│ │
│ 4. MITIGATION │
│ ↓ │
│ • Implement fix or workaround │
│ • Rollback if necessary │
│ • Verify resolution │
│ • Monitor for recurrence │
│ │
│ 5. COMMUNICATION │
│ ↓ │
│ • Update status page │
│ • Notify affected users │
│ • Internal stakeholder update │
│ │
│ 6. FOLLOW-UP │
│ • Post-mortem within 48 hours │
│ • Action items tracked in GitScrum │
│ • Process improvements identified │
└─────────────────────────────────────────────────────────────┘
Incident Task Structure
INCIDENT TASK IN GITSCRUM:
┌─────────────────────────────────────────────────────────────┐
│ 🔴 INC-047: API 503 Errors - Payment Service │
├─────────────────────────────────────────────────────────────┤
│ │
│ Severity: SEV 1 - CRITICAL │
│ Status: 🔥 Active Incident │
│ Commander: @alex │
│ Started: 2024-01-15 14:32 UTC │
│ │
│ ═══════════════════════════════════════════════════════════ │
│ │
│ IMPACT: │
│ • All payment processing failing │
│ • ~100% of checkout attempts affected │
│ • Started: 14:32 UTC │
│ • Duration: 45 minutes (ongoing) │
│ │
│ TIMELINE: │
│ 14:32 - Alert: Payment API error rate > 50% │
│ 14:35 - Incident declared, @alex on point │
│ 14:40 - Identified: Database connection exhausted │
│ 14:45 - Root cause: Recent deploy increased connections │
│ 14:50 - Mitigation: Rolling back deploy │
│ 15:00 - Rollback complete, verifying │
│ │
│ ROOT CAUSE: │
│ Deploy at 14:15 introduced connection leak in new code │
│ Connection pool exhausted after ~15 minutes │
│ │
│ RESOLUTION: │
│ ☐ Rollback completed │
│ ☐ Services recovering │
│ ☐ Monitoring for stability │
│ │
│ LINKED: │
│ • Alert: #12345 (PagerDuty) │
│ • Rollback PR: #789 │
│ • Follow-up: INC-047-PM (post-mortem) │
└─────────────────────────────────────────────────────────────┘
Communication
Stakeholder Updates
INCIDENT COMMUNICATION:
┌─────────────────────────────────────────────────────────────┐
│ │
│ INTERNAL COMMUNICATION: │
│ │
│ INITIAL NOTIFICATION (< 15 min): │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ 🔴 INCIDENT: Payment processing down ││
│ │ ││
│ │ Severity: SEV 1 ││
│ │ Impact: All checkouts failing ││
│ │ Started: 14:32 UTC ││
│ │ Commander: @alex ││
│ │ Channel: #incident-047 ││
│ │ ││
│ │ Updates will follow every 15 minutes. ││
│ └─────────────────────────────────────────────────────────┘│
│ │
│ UPDATE (Every 15-30 min): │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ 🟡 UPDATE: Payment incident ││
│ │ ││
│ │ Status: Mitigation in progress ││
│ │ Root cause: Database connection issue ││
│ │ Action: Rolling back recent deploy ││
│ │ ETA to resolution: 15 minutes ││
│ └─────────────────────────────────────────────────────────┘│
│ │
│ RESOLUTION: │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ ✅ RESOLVED: Payment incident ││
│ │ ││
│ │ Duration: 45 minutes ││
│ │ Resolution: Rolled back problematic deploy ││
│ │ Post-mortem scheduled: Tomorrow 10am ││
│ └─────────────────────────────────────────────────────────┘│
└─────────────────────────────────────────────────────────────┘
Status Page
EXTERNAL STATUS UPDATES:
┌─────────────────────────────────────────────────────────────┐
│ │
│ CUSTOMER-FACING STATUS PAGE: │
│ │
│ INITIAL: │
│ "We are investigating issues with payment processing. │
│ We will provide updates as we learn more." │
│ │
│ PROGRESS: │
│ "We have identified the cause and are implementing │
│ a fix. We expect to resolve this within 30 minutes." │
│ │
│ RESOLVED: │
│ "Payment processing has been restored. All services │
│ are operating normally. We apologize for any │
│ inconvenience caused." │
│ │
│ ─────────────────────────────────────────────────────────── │
│ │
│ PRINCIPLES: │
│ │
│ ✅ DO: │
│ • Acknowledge quickly │
│ • Update regularly │
│ • Be honest about impact │
│ • Provide ETAs when known │
│ • Apologize sincerely │
│ │
│ ❌ DON'T: │
│ • Blame (people, vendors) │
│ • Get too technical │
│ • Over-promise resolution time │
│ • Go silent during incident │
└─────────────────────────────────────────────────────────────┘
Post-Mortem
Blameless Analysis
POST-MORTEM PROCESS:
┌─────────────────────────────────────────────────────────────┐
│ │
│ TIMING: Within 48 hours of incident │
│ │
│ ATTENDEES: │
│ • Incident commander │
│ • Responders involved │
│ • Relevant team leads │
│ • (Optional) Stakeholders │
│ │
│ AGENDA: │
│ │
│ 1. TIMELINE REVIEW (15 min) │
│ • What happened, in sequence │
│ • When did we detect, respond, resolve │
│ │
│ 2. ROOT CAUSE ANALYSIS (20 min) │
│ • Why did this happen? │
│ • 5 Whys technique │
│ • Contributing factors │
│ │
│ 3. WHAT WENT WELL (10 min) │
│ • Fast detection? │
│ • Good communication? │
│ • Effective mitigation? │
│ │
│ 4. WHAT COULD IMPROVE (10 min) │
│ • Where did we struggle? │
│ • What was missing? │
│ • What would help next time? │
│ │
│ 5. ACTION ITEMS (15 min) │
│ • Specific, assigned, time-bound │
│ • Track in GitScrum │
│ │
│ BLAMELESS: │
│ Focus on systems and processes, not people │
│ "The deploy process allowed..." not "@alex broke..." │
└─────────────────────────────────────────────────────────────┘
Post-Mortem Document
POST-MORTEM TEMPLATE:
┌─────────────────────────────────────────────────────────────┐
│ POST-MORTEM: INC-047 Payment Service Outage │
├─────────────────────────────────────────────────────────────┤
│ │
│ DATE: 2024-01-15 │
│ DURATION: 45 minutes │
│ SEVERITY: SEV 1 │
│ AUTHOR: @alex │
│ │
│ ═══════════════════════════════════════════════════════════ │
│ │
│ SUMMARY: │
│ A deploy at 14:15 introduced a database connection leak. │
│ Connection pool exhausted by 14:32, causing all payment │
│ API calls to fail with 503 errors until rollback at │
│ 15:17. │
│ │
│ IMPACT: │
│ • ~500 failed checkout attempts │
│ • $X estimated lost revenue │
│ • Customer support spike: 47 tickets │
│ │
│ ROOT CAUSE: │
│ New retry logic in payment-service opened new database │
│ connection on each retry instead of reusing existing │
│ connection. Under load, exhausted 100-connection pool. │
│ │
│ CONTRIBUTING FACTORS: │
│ • No connection pool monitoring alarm │
│ • Load testing didn't simulate retry scenarios │
│ • Code review didn't catch connection handling │
│ │
│ ACTION ITEMS: │
│ ☐ [P1] Add connection pool utilization alarm │
│ Owner: @jordan, Due: 2024-01-17 │
│ ☐ [P2] Update load tests to include retry scenarios │
│ Owner: @maria, Due: 2024-01-22 │
│ ☐ [P2] Add connection handling to review checklist │
│ Owner: @alex, Due: 2024-01-19 │
│ │
│ LESSONS LEARNED: │
│ Retry logic needs careful review for resource leaks. │
│ Connection pool monitoring should be standard for all │
│ database-connected services. │
└─────────────────────────────────────────────────────────────┘
Prevention
Learning from Incidents
TURNING INCIDENTS INTO IMPROVEMENTS:
┌─────────────────────────────────────────────────────────────┐
│ │
│ FROM POST-MORTEM TO PREVENTION: │
│ │
│ INCIDENT PATTERN SYSTEMIC FIX │
│ ───────────────────────────── ──────────────────── │
│ Deploy caused issue Canary deployments │
│ Missed in code review Updated checklist │
│ Slow to detect Better monitoring │
│ Took long to diagnose Improved logging │
│ No runbook existed Create runbooks │
│ │
│ ─────────────────────────────────────────────────────────── │
│ │
│ TRACK IN GITSCRUM: │
│ │
│ Labels: post-mortem, prevention │
│ │
│ Example tasks: │
│ PM-047-1: Add connection pool alarms │
│ PM-047-2: Update load test scenarios │
│ PM-047-3: Add to code review checklist │
│ │
│ REVIEW QUARTERLY: │
│ • What incidents occurred? │
│ • What patterns do we see? │
│ • Are action items completed? │
│ • Has recurrence reduced? │
│ │
│ METRICS: │
│ • MTTR (Mean Time To Resolve): trending down ✅ │
│ • Incident frequency: stable │
│ • Repeat incidents: 0 this quarter ✅ │
└─────────────────────────────────────────────────────────────┘