Try free
12 min read Guide 95 of 877

Handling Urgent Production Issues

Production incidents demand immediate, coordinated response. GitScrum's incident management capabilities help teams respond rapidly through dedicated escalation workflows, real-time communication integration, and structured incident tracking. The key is having rehearsed playbooks so teams act instinctively rather than figuring things out under pressure, then conducting thorough post-mortems to prevent recurrence.

Incident Response Framework

Severity Classification

INCIDENT SEVERITY LEVELS:
┌─────────────────────────────────────────────────────────────┐
│ DEFINING SEVERITY                                           │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│ SEV-1: CRITICAL                                             │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ Definition:                                             ││
│ │ • Core service completely down                          ││
│ │ • Data loss or security breach                          ││
│ │ • All customers affected                                ││
│ │                                                         ││
│ │ Examples:                                               ││
│ │ • Production database offline                           ││
│ │ • Payment processing failed                             ││
│ │ • Authentication system down                            ││
│ │                                                         ││
│ │ Response:                                               ││
│ │ • All hands on deck immediately                         ││
│ │ • Leadership notified within 5 minutes                  ││
│ │ • Customer communication within 15 minutes              ││
│ │ • Continuous updates until resolved                     ││
│ └─────────────────────────────────────────────────────────┘│
│                                                             │
│ SEV-2: HIGH                                                 │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ Definition:                                             ││
│ │ • Major feature unavailable                             ││
│ │ • Significant performance degradation                   ││
│ │ • Many customers affected                               ││
│ │                                                         ││
│ │ Examples:                                               ││
│ │ • Search functionality broken                           ││
│ │ • Reports not generating                                ││
│ │ • API response times > 30 seconds                       ││
│ │                                                         ││
│ │ Response:                                               ││
│ │ • On-call team engaged within 15 minutes                ││
│ │ • Business hours: team lead aware                       ││
│ │ • Updates every 30 minutes                              ││
│ └─────────────────────────────────────────────────────────┘│
│                                                             │
│ SEV-3: MEDIUM                                               │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ Definition:                                             ││
│ │ • Minor feature issues                                  ││
│ │ • Workaround available                                  ││
│ │ • Some customers affected                               ││
│ │                                                         ││
│ │ Response:                                               ││
│ │ • Address within current business day                   ││
│ │ • Track in normal workflow                              ││
│ └─────────────────────────────────────────────────────────┘│
│                                                             │
│ SEV-4: LOW                                                  │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ Definition:                                             ││
│ │ • Cosmetic issues                                       ││
│ │ • Edge cases                                            ││
│ │ • Documentation errors                                  ││
│ │                                                         ││
│ │ Response:                                               ││
│ │ • Normal backlog prioritization                         ││
│ └─────────────────────────────────────────────────────────┘│
│                                                             │
└─────────────────────────────────────────────────────────────┘

Incident Roles

RESPONSE TEAM STRUCTURE:
┌─────────────────────────────────────────────────────────────┐
│ ROLE ASSIGNMENTS                                            │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│ INCIDENT COMMANDER (IC):                                    │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ WHO: On-call engineer or designated leader              ││
│ │                                                         ││
│ │ RESPONSIBILITIES:                                       ││
│ │ • Coordinates all response activities                   ││
│ │ • Makes decisions on escalation                         ││
│ │ • Assigns tasks to team members                         ││
│ │ • Decides when to declare "resolved"                    ││
│ │                                                         ││
│ │ NOT RESPONSIBLE FOR:                                    ││
│ │ • Debugging code (delegates to others)                  ││
│ │ • Writing customer communications                       ││
│ └─────────────────────────────────────────────────────────┘│
│                                                             │
│ TECHNICAL LEAD:                                             │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ WHO: Most experienced engineer available                ││
│ │                                                         ││
│ │ RESPONSIBILITIES:                                       ││
│ │ • Investigates root cause                               ││
│ │ • Implements fixes                                      ││
│ │ • Advises IC on technical decisions                     ││
│ │ • Coordinates with other engineers                      ││
│ └─────────────────────────────────────────────────────────┘│
│                                                             │
│ COMMUNICATIONS LEAD:                                        │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ WHO: PM, support lead, or designated communicator       ││
│ │                                                         ││
│ │ RESPONSIBILITIES:                                       ││
│ │ • Updates status page                                   ││
│ │ • Notifies stakeholders                                 ││
│ │ • Responds to customer inquiries                        ││
│ │ • Maintains incident timeline                           ││
│ └─────────────────────────────────────────────────────────┘│
│                                                             │
│ SCRIBE:                                                     │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ WHO: Any available team member                          ││
│ │                                                         ││
│ │ RESPONSIBILITIES:                                       ││
│ │ • Documents everything in real-time                     ││
│ │ • Records decisions and reasons                         ││
│ │ • Captures timeline of events                           ││
│ │ • Creates foundation for post-mortem                    ││
│ └─────────────────────────────────────────────────────────┘│
│                                                             │
└─────────────────────────────────────────────────────────────┘

Response Workflow

Initial Response

FIRST 15 MINUTES:
┌─────────────────────────────────────────────────────────────┐
│ IMMEDIATE ACTIONS                                           │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│ MINUTE 0-5: DETECTION & DECLARATION                         │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ 1. Alert triggered (monitoring) or reported (user)      ││
│ │ 2. On-call receives notification                        ││
│ │ 3. Verify incident is real (not false alarm)            ││
│ │ 4. Declare severity level                               ││
│ │ 5. Create incident channel/task                         ││
│ │                                                         ││
│ │ IN GITSCRUM:                                            ││
│ │ Create task: "INCIDENT: [Brief description]"            ││
│ │ Add labels: incident, sev-1 (or appropriate level)      ││
│ │ Assign: Incident Commander                              ││
│ └─────────────────────────────────────────────────────────┘│
│                                                             │
│ MINUTE 5-10: MOBILIZATION                                   │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ 1. Page relevant team members (if needed)               ││
│ │ 2. Open war room (video call or chat channel)           ││
│ │ 3. Assign roles: IC, Tech Lead, Comms, Scribe           ││
│ │ 4. Brief everyone on known facts                        ││
│ │                                                         ││
│ │ IC SAYS:                                                ││
│ │ "We have a SEV-1: [description]. I'm IC. [Name] is      ││
│ │  Tech Lead. [Name] is handling comms. [Name] is scribe. ││
│ │  Here's what we know so far..."                         ││
│ └─────────────────────────────────────────────────────────┘│
│                                                             │
│ MINUTE 10-15: INITIAL INVESTIGATION                         │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ 1. Check recent deployments (last 24 hours)             ││
│ │ 2. Review monitoring dashboards                         ││
│ │ 3. Check external dependencies status                   ││
│ │ 4. Gather initial hypotheses                            ││
│ │                                                         ││
│ │ TECH LEAD ASKS:                                         ││
│ │ "Did anything deploy recently?"                         ││
│ │ "When did metrics start degrading?"                     ││
│ │ "Is this isolated or widespread?"                       ││
│ └─────────────────────────────────────────────────────────┘│
│                                                             │
└─────────────────────────────────────────────────────────────┘

Diagnosis and Resolution

INVESTIGATION PHASE:
┌─────────────────────────────────────────────────────────────┐
│ FINDING THE CAUSE                                           │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│ STRUCTURED DIAGNOSIS:                                       │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ 1. WHAT CHANGED?                                        ││
│ │    • Recent deployments                                 ││
│ │    • Configuration changes                              ││
│ │    • Traffic patterns                                   ││
│ │    • External service updates                           ││
│ │                                                         ││
│ │ 2. WHERE IS IT BREAKING?                                ││
│ │    • Which service/component                            ││
│ │    • Which region/datacenter                            ││
│ │    • Which customer segment                             ││
│ │                                                         ││
│ │ 3. WHAT ARE THE SYMPTOMS?                               ││
│ │    • Error messages in logs                             ││
│ │    • Metric anomalies                                   ││
│ │    • User-reported behavior                             ││
│ └─────────────────────────────────────────────────────────┘│
│                                                             │
│ MITIGATION VS FIX:                                          │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ MITIGATION (DO FIRST):                                  ││
│ │ • Rollback recent deployment                            ││
│ │ • Restart service                                       ││
│ │ • Scale up resources                                    ││
│ │ • Enable feature flag bypass                            ││
│ │ • Redirect traffic                                      ││
│ │                                                         ││
│ │ GOAL: Restore service ASAP, even temporarily            ││
│ │                                                         ││
│ │ FIX (AFTER STABILIZED):                                 ││
│ │ • Identify and fix root cause                           ││
│ │ • Test fix properly                                     ││
│ │ • Deploy with monitoring                                ││
│ │                                                         ││
│ │ "It's okay to rollback now, fix properly tomorrow"      ││
│ └─────────────────────────────────────────────────────────┘│
│                                                             │
└─────────────────────────────────────────────────────────────┘

Communication

Status Updates

COMMUNICATION PATTERNS:
┌─────────────────────────────────────────────────────────────┐
│ KEEPING STAKEHOLDERS INFORMED                               │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│ INTERNAL UPDATES (in GitScrum Discussions):                 │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ TEMPLATE:                                               ││
│ │                                                         ││
│ │ **[TIME] INCIDENT UPDATE - [INCIDENT NAME]**            ││
│ │                                                         ││
│ │ **Status:** Investigating / Identified / Mitigating     ││
│ │ **Impact:** [Who/what is affected]                      ││
│ │ **Current action:** [What we're doing]                  ││
│ │ **Next update:** [Time of next update]                  ││
│ │                                                         ││
│ │ Example:                                                ││
│ │ "14:35 INCIDENT UPDATE - API Latency                    ││
│ │  Status: Identified                                     ││
│ │  Impact: 50% of API requests timing out                 ││
│ │  Current: Rolling back deploy from 13:45                ││
│ │  Next update: 14:45 or when resolved"                   ││
│ └─────────────────────────────────────────────────────────┘│
│                                                             │
│ CUSTOMER-FACING (status page):                              │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ INVESTIGATING:                                          ││
│ │ "We're investigating reports of slow response times.    ││
│ │  Our team is actively looking into the issue."          ││
│ │                                                         ││
│ │ IDENTIFIED:                                             ││
│ │ "We've identified the cause and are working on a fix.   ││
│ │  Some users may experience delays."                     ││
│ │                                                         ││
│ │ RESOLVED:                                               ││
│ │ "The issue has been resolved. All services are          ││
│ │  operating normally. We'll share details in our         ││
│ │  post-incident report."                                 ││
│ └─────────────────────────────────────────────────────────┘│
│                                                             │
│ UPDATE FREQUENCY:                                           │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ SEV-1: Every 15 minutes or on status change             ││
│ │ SEV-2: Every 30 minutes                                 ││
│ │ SEV-3: On status change                                 ││
│ │                                                         ││
│ │ Even if no progress: "Still investigating. No new info."││
│ │ Silence is worse than "no update"                       ││
│ └─────────────────────────────────────────────────────────┘│
│                                                             │
└─────────────────────────────────────────────────────────────┘

Post-Incident

Immediate Follow-up

AFTER RESOLUTION:
┌─────────────────────────────────────────────────────────────┐
│ CLOSING THE INCIDENT                                        │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│ BEFORE DECLARING RESOLVED:                                  │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ ☑ Service metrics back to normal                        ││
│ │ ☑ Error rates returned to baseline                      ││
│ │ ☑ No customer reports of ongoing issues                 ││
│ │ ☑ Monitoring confirms stability (15+ minutes)           ││
│ │ ☑ Rollback/fix verified working                         ││
│ └─────────────────────────────────────────────────────────┘│
│                                                             │
│ IMMEDIATE ACTIONS:                                          │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ 1. Announce resolution internally                       ││
│ │ 2. Update status page to "Resolved"                     ││
│ │ 3. Thank the team                                       ││
│ │ 4. Schedule post-mortem (within 48 hours)               ││
│ │ 5. Update GitScrum task with resolution details         ││
│ │ 6. Create follow-up tasks for permanent fixes           ││
│ └─────────────────────────────────────────────────────────┘│
│                                                             │
│ IN GITSCRUM:                                                │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ Update incident task:                                   ││
│ │ • Move to "Done"                                        ││
│ │ • Add comment with timeline summary                     ││
│ │ • Link to post-mortem document                          ││
│ │                                                         ││
│ │ Create follow-up tasks:                                 ││
│ │ • "Fix: [Root cause]" - type/bug, priority/high         ││
│ │ • "Improve: [Prevention measure]"                       ││
│ │ • "Post-mortem: [Incident name]"                        ││
│ └─────────────────────────────────────────────────────────┘│
│                                                             │
└─────────────────────────────────────────────────────────────┘

Blameless Post-Mortem

LEARNING FROM INCIDENTS:
┌─────────────────────────────────────────────────────────────┐
│ POST-MORTEM STRUCTURE                                       │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│ DOCUMENT IN NOTEVAULT:                                      │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ # Post-Mortem: [Incident Name] - [Date]                 ││
│ │                                                         ││
│ │ ## Summary                                              ││
│ │ • Duration: 2h 15m                                      ││
│ │ • Impact: 45% of users affected                         ││
│ │ • Severity: SEV-1                                       ││
│ │                                                         ││
│ │ ## Timeline                                             ││
│ │ | Time  | Event                              |          ││
│ │ |-------|-------------------------------------|          ││
│ │ | 13:45 | Deployment to production           |          ││
│ │ | 14:02 | First customer report              |          ││
│ │ | 14:15 | Incident declared                  |          ││
│ │ | 14:30 | Root cause identified              |          ││
│ │ | 15:00 | Rollback completed                 |          ││
│ │ | 16:00 | Full recovery confirmed            |          ││
│ │                                                         ││
│ │ ## Root Cause                                           ││
│ │ Database migration in deployment removed index          ││
│ │ needed for primary query path.                          ││
│ │                                                         ││
│ │ ## Contributing Factors                                 ││
│ │ • No load testing on staging with production data       ││
│ │ • Deployment during peak hours                          ││
│ │ • Missing index dependency in migration review          ││
│ │                                                         ││
│ │ ## What Went Well                                       ││
│ │ • Fast detection (17 minutes)                           ││
│ │ • Clear communication throughout                        ││
│ │ • Rollback procedure worked as designed                 ││
│ │                                                         ││
│ │ ## Action Items                                         ││
│ │ • Add query performance tests to CI                     ││
│ │ • Change deploy window to low-traffic hours             ││
│ │ • Add index dependency check to migration review        ││
│ └─────────────────────────────────────────────────────────┘│
│                                                             │
│ BLAMELESS PRINCIPLES:                                       │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ ✅ "The system allowed this to happen"                   ││
│ │ ❌ "John caused the outage"                              ││
│ │                                                         ││
│ │ ✅ "The migration review process didn't catch this"      ││
│ │ ❌ "The reviewer should have noticed"                    ││
│ │                                                         ││
│ │ Focus: How do we prevent ANY engineer from making       ││
│ │        this mistake, not "who made the mistake"         ││
│ └─────────────────────────────────────────────────────────┘│
│                                                             │
└─────────────────────────────────────────────────────────────┘

Preparation

On-Call Readiness

BEING PREPARED:
┌─────────────────────────────────────────────────────────────┐
│ ON-CALL SETUP                                               │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│ ON-CALL CHECKLIST:                                          │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ ☐ Laptop charged and accessible                         ││
│ │ ☐ VPN/access working                                    ││
│ │ ☐ Alert apps installed and notifications on             ││
│ │ ☐ Know how to reach secondary on-call                   ││
│ │ ☐ Runbooks bookmarked and accessible                    ││
│ │ ☐ Recent deployments reviewed                           ││
│ │ ☐ Know current system health                            ││
│ └─────────────────────────────────────────────────────────┘│
│                                                             │
│ RUNBOOKS IN NOTEVAULT:                                      │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ Create runbooks for common scenarios:                   ││
│ │                                                         ││
│ │ ## Database Issues                                      ││
│ │ 1. Check connection pool status                         ││
│ │ 2. Review slow query log                                ││
│ │ 3. Check disk space                                     ││
│ │ 4. Restart read replicas if needed                      ││
│ │                                                         ││
│ │ ## API Latency                                          ││
│ │ 1. Check service CPU/memory                             ││
│ │ 2. Review recent deployments                            ││
│ │ 3. Check external dependency status                     ││
│ │ 4. Consider scaling or rollback                         ││
│ │                                                         ││
│ │ ## Authentication Failures                              ││
│ │ 1. Check auth service health                            ││
│ │ 2. Verify token signing key                             ││
│ │ 3. Check Redis/session store                            ││
│ │ 4. Review certificate expiry                            ││
│ └─────────────────────────────────────────────────────────┘│
│                                                             │
└─────────────────────────────────────────────────────────────┘