10 min read • Guide 836 of 877

Incident Response for Development Teams

When things break, process helps. GitScrum helps teams track incidents alongside development work, connecting fixes to their triggering events.

Incident Basics

What Is an Incident

INCIDENT DEFINITION:
┌─────────────────────────────────────────────────────────────┐
│                                                             │
│ AN INCIDENT IS:                                             │
│ ───────────────                                             │
│ Unplanned interruption to service                         │
│ Significant degradation of service quality                │
│ Breach of SLA or SLO                                      │
│                                                             │
│ ─────────────────────────────────────────────────────────── │
│                                                             │
│ SEVERITY LEVELS:                                            │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ SEV-1 (CRITICAL):                                       ││
│ │ • Complete service outage                              ││
│ │ • Major feature broken for all users                  ││
│ │ • Data loss or security breach                        ││
│ │ Response: All hands, 24/7                             ││
│ │                                                         ││
│ │ SEV-2 (HIGH):                                           ││
│ │ • Partial outage or degradation                       ││
│ │ • Major feature broken for some users                 ││
│ │ • Workaround exists but painful                       ││
│ │ Response: Immediate during work hours                 ││
│ │                                                         ││
│ │ SEV-3 (MEDIUM):                                         ││
│ │ • Minor feature broken                                ││
│ │ • Performance degraded but usable                     ││
│ │ • Small subset of users affected                      ││
│ │ Response: Next business day                           ││
│ │                                                         ││
│ │ SEV-4 (LOW):                                            ││
│ │ • Cosmetic issues                                     ││
│ │ • Minor inconvenience                                 ││
│ │ Response: Scheduled work                              ││
│ └─────────────────────────────────────────────────────────┘│
│                                                             │
│ GOAL: Restore service ASAP, learn after                   │
└─────────────────────────────────────────────────────────────┘

Incident Phases

Response Workflow

INCIDENT LIFECYCLE:
┌─────────────────────────────────────────────────────────────┐
│                                                             │
│ PHASE 1: DETECT                                             │
│ ───────────────                                             │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ SOURCES:                                                ││
│ │ • Automated monitoring alerts                         ││
│ │ • Customer reports                                    ││
│ │ • Internal team notices                               ││
│ │                                                         ││
│ │ ACTION:                                                  ││
│ │ Create incident record immediately                    ││
│ │ Don't wait to confirm                                 ││
│ └─────────────────────────────────────────────────────────┘│
│                                                             │
│ PHASE 2: TRIAGE                                             │
│ ───────────────                                             │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ QUESTIONS:                                              ││
│ │ • What's the impact?                                  ││
│ │ • Who's affected?                                     ││
│ │ • What's the severity?                                ││
│ │                                                         ││
│ │ ACTION:                                                  ││
│ │ Assign severity                                       ││
│ │ Page appropriate responders                           ││
│ │ Open incident channel                                 ││
│ └─────────────────────────────────────────────────────────┘│
│                                                             │
│ PHASE 3: RESPOND                                            │
│ ────────────────                                            │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ INVESTIGATE:                                            ││
│ │ • Check logs, metrics, recent changes                 ││
│ │ • Form hypotheses                                     ││
│ │ • Test and validate                                   ││
│ │                                                         ││
│ │ FIX:                                                     ││
│ │ • Apply remediation (restart, rollback, patch)        ││
│ │ • Verify service restored                             ││
│ │ • Monitor for recurrence                              ││
│ └─────────────────────────────────────────────────────────┘│
│                                                             │
│ PHASE 4: RECOVER                                            │
│ ────────────────                                            │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ • Confirm service stable                              ││
│ │ • Close incident                                      ││
│ │ • Schedule postmortem                                 ││
│ │ • Communicate resolution                              ││
│ └─────────────────────────────────────────────────────────┘│
│                                                             │
│ PHASE 5: LEARN                                              │
│ ──────────────                                              │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ • Conduct postmortem                                  ││
│ │ • Document findings                                   ││
│ │ • Create follow-up tasks                              ││
│ │ • Share learnings                                     ││
│ └─────────────────────────────────────────────────────────┘│
└─────────────────────────────────────────────────────────────┘

Roles During Incidents

Incident Team

INCIDENT ROLES:
┌─────────────────────────────────────────────────────────────┐
│                                                             │
│ INCIDENT COMMANDER (IC):                                   │
│ ────────────────────────                                    │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ RESPONSIBILITIES:                                       ││
│ │ • Coordinates response                                ││
│ │ • Makes decisions when unclear                        ││
│ │ • Ensures communication happens                       ││
│ │ • Delegates tasks                                     ││
│ │                                                         ││
│ │ DOES NOT:                                                ││
│ │ • Debug the issue (usually)                           ││
│ │ • Write code                                          ││
│ │                                                         ││
│ │ SAYS:                                                    ││
│ │ "Alex, investigate database connections"              ││
│ │ "Jordan, update status page"                          ││
│ │ "What's our current theory?"                          ││
│ └─────────────────────────────────────────────────────────┘│
│                                                             │
│ TECHNICAL RESPONDERS:                                       │
│ ─────────────────────                                       │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ RESPONSIBILITIES:                                       ││
│ │ • Investigate root cause                              ││
│ │ • Propose and implement fixes                         ││
│ │ • Report findings to IC                               ││
│ │                                                         ││
│ │ SAYS:                                                    ││
│ │ "I see connection pool exhausted in logs"             ││
│ │ "Restarting service now"                              ││
│ │ "Fix deployed, monitoring"                            ││
│ └─────────────────────────────────────────────────────────┘│
│                                                             │
│ COMMUNICATIONS LEAD:                                        │
│ ────────────────────                                        │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ RESPONSIBILITIES:                                       ││
│ │ • Update status page                                  ││
│ │ • Communicate with customers                          ││
│ │ • Keep stakeholders informed                          ││
│ │                                                         ││
│ │ For smaller teams: IC handles this                    ││
│ └─────────────────────────────────────────────────────────┘│
│                                                             │
│ SCRIBE:                                                     │
│ ───────                                                     │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ RESPONSIBILITIES:                                       ││
│ │ • Document timeline                                   ││
│ │ • Record actions taken                                ││
│ │ • Note key findings                                   ││
│ │                                                         ││
│ │ Essential for postmortem accuracy                     ││
│ └─────────────────────────────────────────────────────────┘│
└─────────────────────────────────────────────────────────────┘

Communication

During Incidents

INCIDENT COMMUNICATION:
┌─────────────────────────────────────────────────────────────┐
│                                                             │
│ INTERNAL COMMUNICATION:                                     │
│ ───────────────────────                                     │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ #incident-2025-01-21-api-outage                        ││
│ │                                                         ││
│ │ 14:32 [IC] Incident declared - API returning 503s     ││
│ │       Severity: SEV-1                                 ││
│ │       Impact: All users affected                      ││
│ │                                                         ││
│ │ 14:35 [Alex] Investigating. DB connections look high. ││
│ │                                                         ││
│ │ 14:38 [IC] Alex continue DB. Jordan check app logs.   ││
│ │                                                         ││
│ │ 14:42 [Alex] Connection pool exhausted. Recent        ││
│ │       deploy added query without closing connections.││
│ │                                                         ││
│ │ 14:45 [IC] Decision: Rollback to previous version.    ││
│ │       Alex proceed with rollback.                     ││
│ │                                                         ││
│ │ 14:48 [Alex] Rollback complete. Monitoring.           ││
│ │                                                         ││
│ │ 14:55 [IC] Service restored. Error rate back to       ││
│ │       normal. Keeping incident open for 15 min.      ││
│ │                                                         ││
│ │ 15:10 [IC] Incident resolved. Postmortem scheduled    ││
│ │       for tomorrow 10 AM.                             ││
│ └─────────────────────────────────────────────────────────┘│
│                                                             │
│ ─────────────────────────────────────────────────────────── │
│                                                             │
│ EXTERNAL COMMUNICATION:                                     │
│ ────────────────────────                                    │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ STATUS PAGE UPDATES:                                    ││
│ │                                                         ││
│ │ 14:35 - Investigating                                  ││
│ │ "We are investigating reports of API errors.          ││
│ │  Some users may experience issues accessing the       ││
│ │  service. We will provide updates as we learn more." ││
│ │                                                         ││
│ │ 14:50 - Identified                                     ││
│ │ "We have identified the cause and are deploying       ││
│ │  a fix. Service should be restored within 15 minutes."││
│ │                                                         ││
│ │ 15:10 - Resolved                                       ││
│ │ "The issue has been resolved. API is fully            ││
│ │  operational. We apologize for any inconvenience."   ││
│ └─────────────────────────────────────────────────────────┘│
│                                                             │
│ UPDATE FREQUENCY:                                           │
│ Every 15-30 min during active incident                    │
│ Even if just "Still investigating, no new info"           │
│ Silence is worse than no news                             │
└─────────────────────────────────────────────────────────────┘

Blameless Postmortems

Learning from Incidents

BLAMELESS POSTMORTEM:
┌─────────────────────────────────────────────────────────────┐
│                                                             │
│ CORE PRINCIPLE:                                             │
│ ───────────────                                             │
│ Focus on WHAT happened, not WHO caused it                 │
│ People did the best they could with information they had  │
│ Blame creates fear, fear hides problems                   │
│                                                             │
│ ─────────────────────────────────────────────────────────── │
│                                                             │
│ POSTMORTEM STRUCTURE:                                       │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ INCIDENT POSTMORTEM                                    ││
│ │ Date: January 21, 2025                                 ││
│ │ Duration: 43 minutes                                   ││
│ │ Severity: SEV-1                                        ││
│ │ Impact: 100% of users, ~$15K revenue loss             ││
│ │                                                         ││
│ │ SUMMARY:                                                ││
│ │ API returned 503 errors for 43 minutes due to         ││
│ │ database connection pool exhaustion caused by         ││
│ │ a query that didn't close connections properly.       ││
│ │                                                         ││
│ │ TIMELINE:                                               ││
│ │ 14:30 - Deploy with new query goes out                ││
│ │ 14:32 - Alerts fire for 503 errors                    ││
│ │ 14:32 - Incident declared                              ││
│ │ 14:42 - Root cause identified                         ││
│ │ 14:48 - Rollback completed                             ││
│ │ 14:55 - Service restored                               ││
│ │ 15:10 - Incident closed                                ││
│ │                                                         ││
│ │ ROOT CAUSE:                                             ││
│ │ New database query in user service didn't release    ││
│ │ connections after use. Under load, pool exhausted.   ││
│ │                                                         ││
│ │ CONTRIBUTING FACTORS:                                   ││
│ │ • No load testing for new query                      ││
│ │ • Connection pool monitoring didn't alert            ││
│ │ • Code review didn't catch missing connection close  ││
│ │                                                         ││
│ │ WHAT WENT WELL:                                         ││
│ │ • Quick detection (2 min to alert)                   ││
│ │ • Fast rollback capability                           ││
│ │ • Clear incident communication                       ││
│ │                                                         ││
│ │ WHAT COULD BE IMPROVED:                                 ││
│ │ • Catch this class of bug before production         ││
│ │ • Better connection pool visibility                  ││
│ │                                                         ││
│ │ ACTION ITEMS:                                           ││
│ │ ☐ Add linter rule for connection handling (Alex)     ││
│ │ ☐ Add connection pool alert threshold (Jordan)       ││
│ │ ☐ Load test for queries > 10 rows (Team)             ││
│ └─────────────────────────────────────────────────────────┘│
│                                                             │
│ ACTION ITEMS → BACKLOG → SCHEDULED                        │
│ Don't let learnings get lost                              │
└─────────────────────────────────────────────────────────────┘

On-Call Practices

Sustainable Response

ON-CALL BEST PRACTICES:
┌─────────────────────────────────────────────────────────────┐
│                                                             │
│ ON-CALL ROTATION:                                           │
│ ─────────────────                                           │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ SCHEDULE:                                               ││
│ │ Week 1: Alex (primary), Jordan (secondary)            ││
│ │ Week 2: Jordan (primary), Sam (secondary)             ││
│ │ Week 3: Sam (primary), Taylor (secondary)             ││
│ │ Week 4: Taylor (primary), Alex (secondary)            ││
│ │                                                         ││
│ │ RULES:                                                   ││
│ │ • Week-long rotations                                 ││
│ │ • Secondary is backup, not full on-call              ││
│ │ • Handoff meeting at rotation change                  ││
│ │ • On-call gets comp time after heavy week            ││
│ └─────────────────────────────────────────────────────────┘│
│                                                             │
│ ─────────────────────────────────────────────────────────── │
│                                                             │
│ SUSTAINABLE ON-CALL:                                        │
│ ────────────────────                                        │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ HEALTHY TARGETS:                                        ││
│ │ • <2 pages per week average                           ││
│ │ • <1 night page per month                             ││
│ │ • >4 people in rotation                               ││
│ │                                                         ││
│ │ IF EXCEEDING:                                           ││
│ │ • Fix noisy alerts                                    ││
│ │ • Improve system reliability                          ││
│ │ • Add more people to rotation                        ││
│ │                                                         ││
│ │ BURNOUT WARNING SIGNS:                                  ││
│ │ • Same person always gets paged                       ││
│ │ • People dreading on-call                             ││
│ │ • High turnover on on-call team                       ││
│ └─────────────────────────────────────────────────────────┘│
│                                                             │
│ ─────────────────────────────────────────────────────────── │
│                                                             │
│ RUNBOOKS:                                                   │
│ ─────────                                                   │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ Every alert should have a runbook link:               ││
│ │                                                         ││
│ │ ALERT: High error rate on API                         ││
│ │ RUNBOOK: docs/runbooks/api-high-errors.md            ││
│ │                                                         ││
│ │ Runbook contains:                                      ││
│ │ • What to check first                                 ││
│ │ • Common causes and fixes                             ││
│ │ • When to escalate                                    ││
│ │ • Who to contact                                      ││
│ └─────────────────────────────────────────────────────────┘│
└─────────────────────────────────────────────────────────────┘

Back to How To Guides