Incident Management in GitScrum | Track, Respond, Learn
Track incidents with severity levels, coordinate response teams, and run blameless post-mortems. GitScrum connects fixes to triggering events for prevention.
9 min read
When things break, fast response matters. GitScrum helps teams coordinate incident response and document learnings for future prevention.
Incident Categories
Severity Levels
INCIDENT SEVERITY CLASSIFICATION:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β SEV 1 - CRITICAL β
β π΄ Complete outage or data loss β
β β’ All users affected β
β β’ Core functionality broken β
β β’ Security breach β
β Response: All hands, immediate escalation β
β SLA: Acknowledge < 15 min, resolve ASAP β
β β
β SEV 2 - HIGH β
β π Major feature unavailable β
β β’ Many users affected β
β β’ Significant functionality broken β
β β’ Workaround exists but painful β
β Response: On-call + relevant team β
β SLA: Acknowledge < 1 hour, resolve < 4 hours β
β β
β SEV 3 - MEDIUM β
β π‘ Feature degraded β
β β’ Some users affected β
β β’ Workaround exists β
β β’ Non-critical feature β
β Response: On-call, escalate if needed β
β SLA: Acknowledge < 4 hours, resolve < 24 hours β
β β
β SEV 4 - LOW β
β π’ Minor issue β
β β’ Few users affected β
β β’ Easy workaround β
β Response: Normal bug process β
β SLA: Triage < 24 hours β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Incident Response
Response Flow
INCIDENT RESPONSE PROCESS:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β 1. DETECTION β
β β β
β β’ Monitoring alert fires β
β β’ User reports issue β
β β’ Team member notices problem β
β β
β 2. TRIAGE (< 15 min for SEV 1) β
β β β
β β’ Assess severity β
β β’ Assign incident commander β
β β’ Create incident task in GitScrum β
β β’ Notify stakeholders β
β β
β 3. INVESTIGATION β
β β β
β β’ Gather team (war room if needed) β
β β’ Review logs, metrics, recent changes β
β β’ Identify root cause β
β β’ Document findings in real-time β
β β
β 4. MITIGATION β
β β β
β β’ Implement fix or workaround β
β β’ Rollback if necessary β
β β’ Verify resolution β
β β’ Monitor for recurrence β
β β
β 5. COMMUNICATION β
β β β
β β’ Update status page β
β β’ Notify affected users β
β β’ Internal stakeholder update β
β β
β 6. FOLLOW-UP β
β β’ Post-mortem within 48 hours β
β β’ Action items tracked in GitScrum β
β β’ Process improvements identified β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Incident Task Structure
INCIDENT TASK IN GITSCRUM:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β π΄ INC-047: API 503 Errors - Payment Service β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Severity: SEV 1 - CRITICAL β
β Status: π₯ Active Incident β
β Commander: @alex β
β Started: 2024-01-15 14:32 UTC β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β IMPACT: β
β β’ All payment processing failing β
β β’ ~100% of checkout attempts affected β
β β’ Started: 14:32 UTC β
β β’ Duration: 45 minutes (ongoing) β
β β
β TIMELINE: β
β 14:32 - Alert: Payment API error rate > 50% β
β 14:35 - Incident declared, @alex on point β
β 14:40 - Identified: Database connection exhausted β
β 14:45 - Root cause: Recent deploy increased connections β
β 14:50 - Mitigation: Rolling back deploy β
β 15:00 - Rollback complete, verifying β
β β
β ROOT CAUSE: β
β Deploy at 14:15 introduced connection leak in new code β
β Connection pool exhausted after ~15 minutes β
β β
β RESOLUTION: β
β β Rollback completed β
β β Services recovering β
β β Monitoring for stability β
β β
β LINKED: β
β β’ Alert: #12345 (PagerDuty) β
β β’ Rollback PR: #789 β
β β’ Follow-up: INC-047-PM (post-mortem) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Communication
Stakeholder Updates
INCIDENT COMMUNICATION:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β INTERNAL COMMUNICATION: β
β β
β INITIAL NOTIFICATION (< 15 min): β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β π΄ INCIDENT: Payment processing down ββ
β β ββ
β β Severity: SEV 1 ββ
β β Impact: All checkouts failing ββ
β β Started: 14:32 UTC ββ
β β Commander: @alex ββ
β β Channel: #incident-047 ββ
β β ββ
β β Updates will follow every 15 minutes. ββ
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β UPDATE (Every 15-30 min): β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β π‘ UPDATE: Payment incident ββ
β β ββ
β β Status: Mitigation in progress ββ
β β Root cause: Database connection issue ββ
β β Action: Rolling back recent deploy ββ
β β ETA to resolution: 15 minutes ββ
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β RESOLUTION: β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β β
RESOLVED: Payment incident ββ
β β ββ
β β Duration: 45 minutes ββ
β β Resolution: Rolled back problematic deploy ββ
β β Post-mortem scheduled: Tomorrow 10am ββ
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Status Page
EXTERNAL STATUS UPDATES:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β CUSTOMER-FACING STATUS PAGE: β
β β
β INITIAL: β
β "We are investigating issues with payment processing. β
β We will provide updates as we learn more." β
β β
β PROGRESS: β
β "We have identified the cause and are implementing β
β a fix. We expect to resolve this within 30 minutes." β
β β
β RESOLVED: β
β "Payment processing has been restored. All services β
β are operating normally. We apologize for any β
β inconvenience caused." β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β PRINCIPLES: β
β β
β β
DO: β
β β’ Acknowledge quickly β
β β’ Update regularly β
β β’ Be honest about impact β
β β’ Provide ETAs when known β
β β’ Apologize sincerely β
β β
β β DON'T: β
β β’ Blame (people, vendors) β
β β’ Get too technical β
β β’ Over-promise resolution time β
β β’ Go silent during incident β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Post-Mortem
Blameless Analysis
POST-MORTEM PROCESS:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β TIMING: Within 48 hours of incident β
β β
β ATTENDEES: β
β β’ Incident commander β
β β’ Responders involved β
β β’ Relevant team leads β
β β’ (Optional) Stakeholders β
β β
β AGENDA: β
β β
β 1. TIMELINE REVIEW (15 min) β
β β’ What happened, in sequence β
β β’ When did we detect, respond, resolve β
β β
β 2. ROOT CAUSE ANALYSIS (20 min) β
β β’ Why did this happen? β
β β’ 5 Whys technique β
β β’ Contributing factors β
β β
β 3. WHAT WENT WELL (10 min) β
β β’ Fast detection? β
β β’ Good communication? β
β β’ Effective mitigation? β
β β
β 4. WHAT COULD IMPROVE (10 min) β
β β’ Where did we struggle? β
β β’ What was missing? β
β β’ What would help next time? β
β β
β 5. ACTION ITEMS (15 min) β
β β’ Specific, assigned, time-bound β
β β’ Track in GitScrum β
β β
β BLAMELESS: β
β Focus on systems and processes, not people β
β "The deploy process allowed..." not "@alex broke..." β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Post-Mortem Document
POST-MORTEM TEMPLATE:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β POST-MORTEM: INC-047 Payment Service Outage β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β DATE: 2024-01-15 β
β DURATION: 45 minutes β
β SEVERITY: SEV 1 β
β AUTHOR: @alex β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β SUMMARY: β
β A deploy at 14:15 introduced a database connection leak. β
β Connection pool exhausted by 14:32, causing all payment β
β API calls to fail with 503 errors until rollback at β
β 15:17. β
β β
β IMPACT: β
β β’ ~500 failed checkout attempts β
β β’ $X estimated lost revenue β
β β’ Customer support spike: 47 tickets β
β β
β ROOT CAUSE: β
β New retry logic in payment-service opened new database β
β connection on each retry instead of reusing existing β
β connection. Under load, exhausted 100-connection pool. β
β β
β CONTRIBUTING FACTORS: β
β β’ No connection pool monitoring alarm β
β β’ Load testing didn't simulate retry scenarios β
β β’ Code review didn't catch connection handling β
β β
β ACTION ITEMS: β
β β [P1] Add connection pool utilization alarm β
β Owner: @jordan, Due: 2024-01-17 β
β β [P2] Update load tests to include retry scenarios β
β Owner: @maria, Due: 2024-01-22 β
β β [P2] Add connection handling to review checklist β
β Owner: @alex, Due: 2024-01-19 β
β β
β LESSONS LEARNED: β
β Retry logic needs careful review for resource leaks. β
β Connection pool monitoring should be standard for all β
β database-connected services. β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Prevention
Learning from Incidents
TURNING INCIDENTS INTO IMPROVEMENTS:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β FROM POST-MORTEM TO PREVENTION: β
β β
β INCIDENT PATTERN SYSTEMIC FIX β
β βββββββββββββββββββββββββββββ ββββββββββββββββββββ β
β Deploy caused issue Canary deployments β
β Missed in code review Updated checklist β
β Slow to detect Better monitoring β
β Took long to diagnose Improved logging β
β No runbook existed Create runbooks β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β TRACK IN GITSCRUM: β
β β
β Labels: post-mortem, prevention β
β β
β Example tasks: β
β PM-047-1: Add connection pool alarms β
β PM-047-2: Update load test scenarios β
β PM-047-3: Add to code review checklist β
β β
β REVIEW QUARTERLY: β
β β’ What incidents occurred? β
β β’ What patterns do we see? β
β β’ Are action items completed? β
β β’ Has recurrence reduced? β
β β
β METRICS: β
β β’ MTTR (Mean Time To Resolve): trending down β
β
β β’ Incident frequency: stable β
β β’ Repeat incidents: 0 this quarter β
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ