9 min read • Guide 777 of 877
Disaster Recovery Planning
Hope for the best, plan for the worst. GitScrum helps teams document recovery procedures, track DR testing, and ensure business continuity.
DR Fundamentals
Recovery Objectives
DEFINING RECOVERY TARGETS:
┌─────────────────────────────────────────────────────────────┐
│ │
│ KEY TERMS: │
│ │
│ RTO (Recovery Time Objective): │
│ Maximum acceptable downtime │
│ "We must be back online within X hours" │
│ │
│ RPO (Recovery Point Objective): │
│ Maximum acceptable data loss │
│ "We can lose at most X hours of data" │
│ │
│ ─────────────────────────────────────────────────────────── │
│ │
│ SERVICE CLASSIFICATION: │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ Service Tier RTO RPO ││
│ │ ─────────────── ──── ────── ────── ││
│ │ Payment API 1 15 min 0 (none) ││
│ │ User Database 1 30 min 5 min ││
│ │ Main App 1 1 hour 1 hour ││
│ │ Analytics 2 4 hours 24 hours ││
│ │ Dev Environment 3 24 hours 1 week ││
│ │ Archive Storage 3 48 hours N/A ││
│ │ ││
│ │ Tier 1: Critical (immediate priority) ││
│ │ Tier 2: Important (restore after Tier 1) ││
│ │ Tier 3: Non-critical (can wait) ││
│ └─────────────────────────────────────────────────────────┘│
│ │
│ COST vs RTO: │
│ Shorter RTO = More expensive infrastructure │
│ Balance business needs with budget │
└─────────────────────────────────────────────────────────────┘
Disaster Scenarios
SCENARIO PLANNING:
┌─────────────────────────────────────────────────────────────┐
│ │
│ COMMON DISASTER SCENARIOS: │
│ │
│ INFRASTRUCTURE: │
│ • Cloud region outage │
│ • Database server failure │
│ • Network connectivity loss │
│ • DNS failure │
│ │
│ DATA: │
│ • Database corruption │
│ • Ransomware encryption │
│ • Accidental data deletion │
│ • Backup failure │
│ │
│ APPLICATION: │
│ • Bad deployment │
│ • Configuration error │
│ • Dependency failure │
│ • Resource exhaustion │
│ │
│ SECURITY: │
│ • Account compromise │
│ • DDoS attack │
│ • Data breach │
│ │
│ ─────────────────────────────────────────────────────────── │
│ │
│ FOR EACH SCENARIO: │
│ • Detection: How will we know? │
│ • Response: What do we do immediately? │
│ • Recovery: How do we restore service? │
│ • Communication: Who do we notify? │
│ • Post-incident: How do we prevent recurrence? │
└─────────────────────────────────────────────────────────────┘
DR Documentation
Recovery Runbooks
RUNBOOK STRUCTURE:
┌─────────────────────────────────────────────────────────────┐
│ │
│ DATABASE RECOVERY RUNBOOK: │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ DR-RUN-001: Primary Database Recovery ││
│ │ ││
│ │ SCENARIO: ││
│ │ Primary database server is unavailable ││
│ │ ││
│ │ DETECTION: ││
│ │ • Monitoring alert: DB connection failures ││
│ │ • Application errors spike ││
│ │ • Health check failures ││
│ │ ││
│ │ IMMEDIATE ACTIONS (< 5 min): ││
│ │ 1. Verify DB is actually down (not false alarm) ││
│ │ 2. Check cloud status page ││
│ │ 3. Declare incident, start comms ││
│ │ ││
│ │ RECOVERY OPTIONS: ││
│ │ ││
│ │ OPTION A: Failover to replica (RTO: 15 min) ││
│ │ 1. Promote read replica to primary ││
│ │ aws rds promote-read-replica --db-id prod-replica ││
│ │ 2. Update connection string in secrets ││
│ │ 3. Deploy application with new connection ││
│ │ 4. Verify connectivity ││
│ │ ││
│ │ OPTION B: Restore from backup (RTO: 2 hours) ││
│ │ 1. Identify latest good backup ││
│ │ 2. Restore to new instance ││
│ │ aws rds restore-db-instance-to-point-in-time ... ││
│ │ 3. Update connection string ││
│ │ 4. Verify data integrity ││
│ │ ││
│ │ VERIFICATION: ││
│ │ ☐ Application health checks passing ││
│ │ ☐ Critical transactions working ││
│ │ ☐ Monitoring showing healthy ││
│ │ ││
│ │ CONTACTS: ││
│ │ DBA: @db-team (Slack), +1-555-DB-TEAM ││
│ │ On-call: PagerDuty escalation ││
│ │ AWS Support: Case priority Critical ││
│ └─────────────────────────────────────────────────────────┘│
└─────────────────────────────────────────────────────────────┘
DR Task Tracking
DR DOCUMENTATION TASKS:
┌─────────────────────────────────────────────────────────────┐
│ │
│ DR EPIC: │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ DR-001: Disaster Recovery Documentation ││
│ │ ││
│ │ Goal: Complete DR coverage for all Tier 1 systems ││
│ │ Deadline: End of Q2 ││
│ │ ││
│ │ RUNBOOKS: ││
│ │ ├── DR-RUN-001: Database recovery ✅ ││
│ │ ├── DR-RUN-002: Application failover ✅ ││
│ │ ├── DR-RUN-003: CDN/DNS failover ⏳ ││
│ │ ├── DR-RUN-004: Cache recovery ⏳ ││
│ │ ├── DR-RUN-005: Queue recovery ☐ ││
│ │ └── DR-RUN-006: Third-party fallback ☐ ││
│ │ ││
│ │ SUPPORTING DOCS: ││
│ │ ├── DR-DOC-001: Contact list ✅ ││
│ │ ├── DR-DOC-002: Communication templates ⏳ ││
│ │ ├── DR-DOC-003: Service dependency map ✅ ││
│ │ └── DR-DOC-004: Backup verification log ✅ ││
│ │ ││
│ │ TESTS: ││
│ │ ├── DR-TEST-001: Database failover test ✅ ││
│ │ ├── DR-TEST-002: Full DR drill (quarterly) ☐ ││
│ │ └── DR-TEST-003: Tabletop exercise ⏳ ││
│ └─────────────────────────────────────────────────────────┘│
└─────────────────────────────────────────────────────────────┘
Testing DR
Test Types
DR TESTING APPROACHES:
┌─────────────────────────────────────────────────────────────┐
│ │
│ TESTING PROGRESSION: │
│ │
│ DOCUMENTATION REVIEW (Weekly): │
│ Verify runbooks are up to date │
│ Check contact information is current │
│ Low risk, low effort │
│ │
│ BACKUP VERIFICATION (Weekly): │
│ Verify backups completed successfully │
│ Test restore to non-production environment │
│ Verify data integrity │
│ │
│ COMPONENT TESTING (Monthly): │
│ Test individual recovery procedures │
│ Database failover │
│ Application restart │
│ │
│ TABLETOP EXERCISE (Monthly): │
│ Walk through scenario without taking action │
│ Identify gaps in procedures │
│ Team discusses "what would we do if..." │
│ │
│ PARTIAL DR DRILL (Quarterly): │
│ Actually execute recovery procedures │
│ Test in production-like environment │
│ Measure actual recovery time │
│ │
│ FULL DR DRILL (Annually): │
│ Simulate major disaster │
│ Recover all systems │
│ Business continuity validation │
└─────────────────────────────────────────────────────────────┘
DR Test Task
DR TEST EXECUTION:
┌─────────────────────────────────────────────────────────────┐
│ │
│ QUARTERLY DR DRILL: │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ DR-TEST-Q2: Q2 Disaster Recovery Drill ││
│ │ ││
│ │ Date: April 15, 2024 (Saturday, 6 AM) ││
│ │ Scenario: Primary region unavailable ││
│ │ Scope: All Tier 1 services ││
│ │ ││
│ │ OBJECTIVES: ││
│ │ ☐ Validate failover to secondary region ││
│ │ ☐ Measure actual RTO vs target ││
│ │ ☐ Verify data integrity post-failover ││
│ │ ☐ Test communication procedures ││
│ │ ││
│ │ SCHEDULE: ││
│ │ 06:00 - Drill start, simulate outage ││
│ │ 06:05 - Detection and incident declaration ││
│ │ 06:15 - Begin failover procedures ││
│ │ 07:30 - Target: All services restored ││
│ │ 08:00 - Verification and wrap-up ││
│ │ 08:30 - Drill end, begin failback ││
│ │ ││
│ │ PARTICIPANTS: ││
│ │ • On-call team ││
│ │ • SRE lead (observer) ││
│ │ • Management (communication practice) ││
│ │ ││
│ │ SUCCESS CRITERIA: ││
│ │ • RTO met for all Tier 1 services ││
│ │ • No data loss beyond RPO ││
│ │ • Communication sent within 10 min ││
│ └─────────────────────────────────────────────────────────┘│
│ │
│ POST-DRILL: │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ DR-TEST-Q2-RETRO: Drill Retrospective ││
│ │ ││
│ │ RESULTS: ││
│ │ RTO Target: 90 min | Actual: 85 min ✅ ││
│ │ RPO Target: 5 min | Actual: 3 min ✅ ││
│ │ ││
│ │ WHAT WENT WELL: ││
│ │ • Runbooks were accurate ││
│ │ • Team coordination was smooth ││
│ │ • Failover completed faster than expected ││
│ │ ││
│ │ WHAT NEEDS IMPROVEMENT: ││
│ │ • Communication template had wrong Slack channel ││
│ │ • Cache warmup took longer than documented ││
│ │ ││
│ │ ACTION ITEMS: ││
│ │ ☐ Update communication template ││
│ │ ☐ Add cache pre-warming to runbook ││
│ │ ☐ Schedule next drill for Q3 ││
│ └─────────────────────────────────────────────────────────┘│
└─────────────────────────────────────────────────────────────┘
Communication
Incident Communication
DISASTER COMMUNICATION:
┌─────────────────────────────────────────────────────────────┐
│ │
│ COMMUNICATION PLAN: │
│ │
│ INTERNAL (Immediate): │
│ • Slack #incident channel │
│ • PagerDuty escalation │
│ • Email to leadership │
│ │
│ EXTERNAL (Within 15 min): │
│ • Status page update │
│ • Twitter/social acknowledgment │
│ • Customer support briefing │
│ │
│ ONGOING (Every 30 min during incident): │
│ • Status page updates │
│ • Internal Slack updates │
│ │
│ ─────────────────────────────────────────────────────────── │
│ │
│ COMMUNICATION TEMPLATES: │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ INITIAL NOTIFICATION: ││
│ │ ││
│ │ Subject: [INCIDENT] Service Disruption ││
│ │ ││
│ │ We are currently experiencing service issues. ││
│ │ Impact: [describe user impact] ││
│ │ Status: Investigating ││
│ │ Next update: 30 minutes ││
│ │ ││
│ │ ───────────────────────────────────────────────────── ││
│ │ ││
│ │ UPDATE: ││
│ │ ││
│ │ We have identified the issue and are working on ││
│ │ restoration. ││
│ │ ETA: [time estimate] ││
│ │ Next update: 30 minutes ││
│ │ ││
│ │ ───────────────────────────────────────────────────── ││
│ │ ││
│ │ RESOLUTION: ││
│ │ ││
│ │ Service has been restored. All systems operational. ││
│ │ Duration: [X hours Y minutes] ││
│ │ Root cause: [brief description] ││
│ │ Post-mortem will follow within 48 hours. ││
│ └─────────────────────────────────────────────────────────┘│
└─────────────────────────────────────────────────────────────┘