12 min read • Guide 95 of 877
Handling Urgent Production Issues
Production incidents demand immediate, coordinated response. GitScrum's incident management capabilities help teams respond rapidly through dedicated escalation workflows, real-time communication integration, and structured incident tracking. The key is having rehearsed playbooks so teams act instinctively rather than figuring things out under pressure, then conducting thorough post-mortems to prevent recurrence.
Incident Response Framework
Severity Classification
INCIDENT SEVERITY LEVELS:
┌─────────────────────────────────────────────────────────────┐
│ DEFINING SEVERITY │
├─────────────────────────────────────────────────────────────┤
│ │
│ SEV-1: CRITICAL │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ Definition: ││
│ │ • Core service completely down ││
│ │ • Data loss or security breach ││
│ │ • All customers affected ││
│ │ ││
│ │ Examples: ││
│ │ • Production database offline ││
│ │ • Payment processing failed ││
│ │ • Authentication system down ││
│ │ ││
│ │ Response: ││
│ │ • All hands on deck immediately ││
│ │ • Leadership notified within 5 minutes ││
│ │ • Customer communication within 15 minutes ││
│ │ • Continuous updates until resolved ││
│ └─────────────────────────────────────────────────────────┘│
│ │
│ SEV-2: HIGH │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ Definition: ││
│ │ • Major feature unavailable ││
│ │ • Significant performance degradation ││
│ │ • Many customers affected ││
│ │ ││
│ │ Examples: ││
│ │ • Search functionality broken ││
│ │ • Reports not generating ││
│ │ • API response times > 30 seconds ││
│ │ ││
│ │ Response: ││
│ │ • On-call team engaged within 15 minutes ││
│ │ • Business hours: team lead aware ││
│ │ • Updates every 30 minutes ││
│ └─────────────────────────────────────────────────────────┘│
│ │
│ SEV-3: MEDIUM │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ Definition: ││
│ │ • Minor feature issues ││
│ │ • Workaround available ││
│ │ • Some customers affected ││
│ │ ││
│ │ Response: ││
│ │ • Address within current business day ││
│ │ • Track in normal workflow ││
│ └─────────────────────────────────────────────────────────┘│
│ │
│ SEV-4: LOW │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ Definition: ││
│ │ • Cosmetic issues ││
│ │ • Edge cases ││
│ │ • Documentation errors ││
│ │ ││
│ │ Response: ││
│ │ • Normal backlog prioritization ││
│ └─────────────────────────────────────────────────────────┘│
│ │
└─────────────────────────────────────────────────────────────┘
Incident Roles
RESPONSE TEAM STRUCTURE:
┌─────────────────────────────────────────────────────────────┐
│ ROLE ASSIGNMENTS │
├─────────────────────────────────────────────────────────────┤
│ │
│ INCIDENT COMMANDER (IC): │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ WHO: On-call engineer or designated leader ││
│ │ ││
│ │ RESPONSIBILITIES: ││
│ │ • Coordinates all response activities ││
│ │ • Makes decisions on escalation ││
│ │ • Assigns tasks to team members ││
│ │ • Decides when to declare "resolved" ││
│ │ ││
│ │ NOT RESPONSIBLE FOR: ││
│ │ • Debugging code (delegates to others) ││
│ │ • Writing customer communications ││
│ └─────────────────────────────────────────────────────────┘│
│ │
│ TECHNICAL LEAD: │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ WHO: Most experienced engineer available ││
│ │ ││
│ │ RESPONSIBILITIES: ││
│ │ • Investigates root cause ││
│ │ • Implements fixes ││
│ │ • Advises IC on technical decisions ││
│ │ • Coordinates with other engineers ││
│ └─────────────────────────────────────────────────────────┘│
│ │
│ COMMUNICATIONS LEAD: │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ WHO: PM, support lead, or designated communicator ││
│ │ ││
│ │ RESPONSIBILITIES: ││
│ │ • Updates status page ││
│ │ • Notifies stakeholders ││
│ │ • Responds to customer inquiries ││
│ │ • Maintains incident timeline ││
│ └─────────────────────────────────────────────────────────┘│
│ │
│ SCRIBE: │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ WHO: Any available team member ││
│ │ ││
│ │ RESPONSIBILITIES: ││
│ │ • Documents everything in real-time ││
│ │ • Records decisions and reasons ││
│ │ • Captures timeline of events ││
│ │ • Creates foundation for post-mortem ││
│ └─────────────────────────────────────────────────────────┘│
│ │
└─────────────────────────────────────────────────────────────┘
Response Workflow
Initial Response
FIRST 15 MINUTES:
┌─────────────────────────────────────────────────────────────┐
│ IMMEDIATE ACTIONS │
├─────────────────────────────────────────────────────────────┤
│ │
│ MINUTE 0-5: DETECTION & DECLARATION │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ 1. Alert triggered (monitoring) or reported (user) ││
│ │ 2. On-call receives notification ││
│ │ 3. Verify incident is real (not false alarm) ││
│ │ 4. Declare severity level ││
│ │ 5. Create incident channel/task ││
│ │ ││
│ │ IN GITSCRUM: ││
│ │ Create task: "INCIDENT: [Brief description]" ││
│ │ Add labels: incident, sev-1 (or appropriate level) ││
│ │ Assign: Incident Commander ││
│ └─────────────────────────────────────────────────────────┘│
│ │
│ MINUTE 5-10: MOBILIZATION │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ 1. Page relevant team members (if needed) ││
│ │ 2. Open war room (video call or chat channel) ││
│ │ 3. Assign roles: IC, Tech Lead, Comms, Scribe ││
│ │ 4. Brief everyone on known facts ││
│ │ ││
│ │ IC SAYS: ││
│ │ "We have a SEV-1: [description]. I'm IC. [Name] is ││
│ │ Tech Lead. [Name] is handling comms. [Name] is scribe. ││
│ │ Here's what we know so far..." ││
│ └─────────────────────────────────────────────────────────┘│
│ │
│ MINUTE 10-15: INITIAL INVESTIGATION │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ 1. Check recent deployments (last 24 hours) ││
│ │ 2. Review monitoring dashboards ││
│ │ 3. Check external dependencies status ││
│ │ 4. Gather initial hypotheses ││
│ │ ││
│ │ TECH LEAD ASKS: ││
│ │ "Did anything deploy recently?" ││
│ │ "When did metrics start degrading?" ││
│ │ "Is this isolated or widespread?" ││
│ └─────────────────────────────────────────────────────────┘│
│ │
└─────────────────────────────────────────────────────────────┘
Diagnosis and Resolution
INVESTIGATION PHASE:
┌─────────────────────────────────────────────────────────────┐
│ FINDING THE CAUSE │
├─────────────────────────────────────────────────────────────┤
│ │
│ STRUCTURED DIAGNOSIS: │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ 1. WHAT CHANGED? ││
│ │ • Recent deployments ││
│ │ • Configuration changes ││
│ │ • Traffic patterns ││
│ │ • External service updates ││
│ │ ││
│ │ 2. WHERE IS IT BREAKING? ││
│ │ • Which service/component ││
│ │ • Which region/datacenter ││
│ │ • Which customer segment ││
│ │ ││
│ │ 3. WHAT ARE THE SYMPTOMS? ││
│ │ • Error messages in logs ││
│ │ • Metric anomalies ││
│ │ • User-reported behavior ││
│ └─────────────────────────────────────────────────────────┘│
│ │
│ MITIGATION VS FIX: │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ MITIGATION (DO FIRST): ││
│ │ • Rollback recent deployment ││
│ │ • Restart service ││
│ │ • Scale up resources ││
│ │ • Enable feature flag bypass ││
│ │ • Redirect traffic ││
│ │ ││
│ │ GOAL: Restore service ASAP, even temporarily ││
│ │ ││
│ │ FIX (AFTER STABILIZED): ││
│ │ • Identify and fix root cause ││
│ │ • Test fix properly ││
│ │ • Deploy with monitoring ││
│ │ ││
│ │ "It's okay to rollback now, fix properly tomorrow" ││
│ └─────────────────────────────────────────────────────────┘│
│ │
└─────────────────────────────────────────────────────────────┘
Communication
Status Updates
COMMUNICATION PATTERNS:
┌─────────────────────────────────────────────────────────────┐
│ KEEPING STAKEHOLDERS INFORMED │
├─────────────────────────────────────────────────────────────┤
│ │
│ INTERNAL UPDATES (in GitScrum Discussions): │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ TEMPLATE: ││
│ │ ││
│ │ **[TIME] INCIDENT UPDATE - [INCIDENT NAME]** ││
│ │ ││
│ │ **Status:** Investigating / Identified / Mitigating ││
│ │ **Impact:** [Who/what is affected] ││
│ │ **Current action:** [What we're doing] ││
│ │ **Next update:** [Time of next update] ││
│ │ ││
│ │ Example: ││
│ │ "14:35 INCIDENT UPDATE - API Latency ││
│ │ Status: Identified ││
│ │ Impact: 50% of API requests timing out ││
│ │ Current: Rolling back deploy from 13:45 ││
│ │ Next update: 14:45 or when resolved" ││
│ └─────────────────────────────────────────────────────────┘│
│ │
│ CUSTOMER-FACING (status page): │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ INVESTIGATING: ││
│ │ "We're investigating reports of slow response times. ││
│ │ Our team is actively looking into the issue." ││
│ │ ││
│ │ IDENTIFIED: ││
│ │ "We've identified the cause and are working on a fix. ││
│ │ Some users may experience delays." ││
│ │ ││
│ │ RESOLVED: ││
│ │ "The issue has been resolved. All services are ││
│ │ operating normally. We'll share details in our ││
│ │ post-incident report." ││
│ └─────────────────────────────────────────────────────────┘│
│ │
│ UPDATE FREQUENCY: │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ SEV-1: Every 15 minutes or on status change ││
│ │ SEV-2: Every 30 minutes ││
│ │ SEV-3: On status change ││
│ │ ││
│ │ Even if no progress: "Still investigating. No new info."││
│ │ Silence is worse than "no update" ││
│ └─────────────────────────────────────────────────────────┘│
│ │
└─────────────────────────────────────────────────────────────┘
Post-Incident
Immediate Follow-up
AFTER RESOLUTION:
┌─────────────────────────────────────────────────────────────┐
│ CLOSING THE INCIDENT │
├─────────────────────────────────────────────────────────────┤
│ │
│ BEFORE DECLARING RESOLVED: │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ ☑ Service metrics back to normal ││
│ │ ☑ Error rates returned to baseline ││
│ │ ☑ No customer reports of ongoing issues ││
│ │ ☑ Monitoring confirms stability (15+ minutes) ││
│ │ ☑ Rollback/fix verified working ││
│ └─────────────────────────────────────────────────────────┘│
│ │
│ IMMEDIATE ACTIONS: │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ 1. Announce resolution internally ││
│ │ 2. Update status page to "Resolved" ││
│ │ 3. Thank the team ││
│ │ 4. Schedule post-mortem (within 48 hours) ││
│ │ 5. Update GitScrum task with resolution details ││
│ │ 6. Create follow-up tasks for permanent fixes ││
│ └─────────────────────────────────────────────────────────┘│
│ │
│ IN GITSCRUM: │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ Update incident task: ││
│ │ • Move to "Done" ││
│ │ • Add comment with timeline summary ││
│ │ • Link to post-mortem document ││
│ │ ││
│ │ Create follow-up tasks: ││
│ │ • "Fix: [Root cause]" - type/bug, priority/high ││
│ │ • "Improve: [Prevention measure]" ││
│ │ • "Post-mortem: [Incident name]" ││
│ └─────────────────────────────────────────────────────────┘│
│ │
└─────────────────────────────────────────────────────────────┘
Blameless Post-Mortem
LEARNING FROM INCIDENTS:
┌─────────────────────────────────────────────────────────────┐
│ POST-MORTEM STRUCTURE │
├─────────────────────────────────────────────────────────────┤
│ │
│ DOCUMENT IN NOTEVAULT: │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ # Post-Mortem: [Incident Name] - [Date] ││
│ │ ││
│ │ ## Summary ││
│ │ • Duration: 2h 15m ││
│ │ • Impact: 45% of users affected ││
│ │ • Severity: SEV-1 ││
│ │ ││
│ │ ## Timeline ││
│ │ | Time | Event | ││
│ │ |-------|-------------------------------------| ││
│ │ | 13:45 | Deployment to production | ││
│ │ | 14:02 | First customer report | ││
│ │ | 14:15 | Incident declared | ││
│ │ | 14:30 | Root cause identified | ││
│ │ | 15:00 | Rollback completed | ││
│ │ | 16:00 | Full recovery confirmed | ││
│ │ ││
│ │ ## Root Cause ││
│ │ Database migration in deployment removed index ││
│ │ needed for primary query path. ││
│ │ ││
│ │ ## Contributing Factors ││
│ │ • No load testing on staging with production data ││
│ │ • Deployment during peak hours ││
│ │ • Missing index dependency in migration review ││
│ │ ││
│ │ ## What Went Well ││
│ │ • Fast detection (17 minutes) ││
│ │ • Clear communication throughout ││
│ │ • Rollback procedure worked as designed ││
│ │ ││
│ │ ## Action Items ││
│ │ • Add query performance tests to CI ││
│ │ • Change deploy window to low-traffic hours ││
│ │ • Add index dependency check to migration review ││
│ └─────────────────────────────────────────────────────────┘│
│ │
│ BLAMELESS PRINCIPLES: │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ ✅ "The system allowed this to happen" ││
│ │ ❌ "John caused the outage" ││
│ │ ││
│ │ ✅ "The migration review process didn't catch this" ││
│ │ ❌ "The reviewer should have noticed" ││
│ │ ││
│ │ Focus: How do we prevent ANY engineer from making ││
│ │ this mistake, not "who made the mistake" ││
│ └─────────────────────────────────────────────────────────┘│
│ │
└─────────────────────────────────────────────────────────────┘
Preparation
On-Call Readiness
BEING PREPARED:
┌─────────────────────────────────────────────────────────────┐
│ ON-CALL SETUP │
├─────────────────────────────────────────────────────────────┤
│ │
│ ON-CALL CHECKLIST: │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ ☐ Laptop charged and accessible ││
│ │ ☐ VPN/access working ││
│ │ ☐ Alert apps installed and notifications on ││
│ │ ☐ Know how to reach secondary on-call ││
│ │ ☐ Runbooks bookmarked and accessible ││
│ │ ☐ Recent deployments reviewed ││
│ │ ☐ Know current system health ││
│ └─────────────────────────────────────────────────────────┘│
│ │
│ RUNBOOKS IN NOTEVAULT: │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ Create runbooks for common scenarios: ││
│ │ ││
│ │ ## Database Issues ││
│ │ 1. Check connection pool status ││
│ │ 2. Review slow query log ││
│ │ 3. Check disk space ││
│ │ 4. Restart read replicas if needed ││
│ │ ││
│ │ ## API Latency ││
│ │ 1. Check service CPU/memory ││
│ │ 2. Review recent deployments ││
│ │ 3. Check external dependency status ││
│ │ 4. Consider scaling or rollback ││
│ │ ││
│ │ ## Authentication Failures ││
│ │ 1. Check auth service health ││
│ │ 2. Verify token signing key ││
│ │ 3. Check Redis/session store ││
│ │ 4. Review certificate expiry ││
│ └─────────────────────────────────────────────────────────┘│
│ │
└─────────────────────────────────────────────────────────────┘