9 min read • Guide 821 of 877
Service Level Objectives
Reliability is a feature. GitScrum helps teams track SLOs alongside feature work, ensuring reliability investments are visible and prioritized.
SLO Fundamentals
SLI, SLO, SLA
SERVICE LEVEL TERMINOLOGY:
┌─────────────────────────────────────────────────────────────┐
│ │
│ SLI (Service Level Indicator): │
│ ────────────────────────────── │
│ The metric being measured │
│ │
│ Examples: │
│ • Request latency (p95) │
│ • Availability (successful requests / total requests) │
│ • Error rate │
│ • Throughput │
│ │
│ ─────────────────────────────────────────────────────────── │
│ │
│ SLO (Service Level Objective): │
│ ────────────────────────────── │
│ The target value for the SLI │
│ │
│ Examples: │
│ • Latency p95 < 200ms │
│ • Availability >= 99.9% │
│ • Error rate < 0.1% │
│ │
│ ─────────────────────────────────────────────────────────── │
│ │
│ SLA (Service Level Agreement): │
│ ────────────────────────────── │
│ Contract with consequences │
│ │
│ Examples: │
│ • "99.9% uptime or customer gets credit" │
│ • Legal commitment │
│ • Usually looser than internal SLO │
│ │
│ ─────────────────────────────────────────────────────────── │
│ │
│ RELATIONSHIP: │
│ │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ ││
│ │ SLI ──────→ SLO ──────→ SLA ││
│ │ (measure) (target) (contract) ││
│ │ ││
│ │ Example: ││
│ │ "Request latency" → "p95 < 200ms" → "99% < 500ms" ││
│ │ ││
│ │ SLO is stricter than SLA (internal buffer) ││
│ │ ││
│ └─────────────────────────────────────────────────────────┘│
└─────────────────────────────────────────────────────────────┘
Defining SLOs
Choosing Metrics
CHOOSING SLIs:
┌─────────────────────────────────────────────────────────────┐
│ │
│ USER-CENTRIC SLIs: │
│ ────────────────── │
│ Choose metrics that reflect user experience │
│ │
│ AVAILABILITY: │
│ "Can users use the service?" │
│ SLI: Successful requests / Total requests │
│ SLO: 99.9% (43.8 minutes downtime/month allowed) │
│ │
│ LATENCY: │
│ "How fast does it respond?" │
│ SLI: Request duration percentiles │
│ SLO: p50 < 100ms, p95 < 200ms, p99 < 500ms │
│ │
│ CORRECTNESS: │
│ "Does it return the right answer?" │
│ SLI: Correct responses / Total responses │
│ SLO: 99.99% correct │
│ │
│ FRESHNESS: │
│ "How recent is the data?" │
│ SLI: Data age │
│ SLO: Data updated within 60 seconds │
│ │
│ ─────────────────────────────────────────────────────────── │
│ │
│ COMMON SLOs: │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ SERVICE SLI SLO ││
│ │ ─────── ─── ─── ││
│ │ API Availability 99.9% ││
│ │ API Latency (p95) < 200ms ││
│ │ Website Page load (p95) < 3s ││
│ │ Database Query time (p95) < 50ms ││
│ │ Checkout Success rate 99.5% ││
│ │ Search Latency (p95) < 500ms ││
│ └─────────────────────────────────────────────────────────┘│
│ │
│ DON'T AIM FOR 100%: │
│ ───────────────────── │
│ 100% availability = No deployments, no changes │
│ Leave room for error budget │
└─────────────────────────────────────────────────────────────┘
Error Budgets
Managing Reliability
ERROR BUDGET CONCEPT:
┌─────────────────────────────────────────────────────────────┐
│ │
│ WHAT IS ERROR BUDGET: │
│ ───────────────────── │
│ The allowed amount of unreliability │
│ │
│ SLO: 99.9% availability │
│ Error Budget: 0.1% (100% - 99.9%) │
│ │
│ Per month (30 days): │
│ 0.1% × 43,200 minutes = 43.2 minutes of allowed downtime │
│ │
│ ─────────────────────────────────────────────────────────── │
│ │
│ ERROR BUDGET STATUS: │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ CHECKOUT SERVICE - January ││
│ │ ││
│ │ SLO: 99.9% availability ││
│ │ Budget: 43.2 minutes ││
│ │ ││
│ │ USAGE THIS MONTH: ││
│ │ ████████████░░░░░░░░░░░░░░░░░ 35% ││
│ │ ││
│ │ Used: 15.1 minutes ││
│ │ Remaining: 28.1 minutes ││
│ │ ││
│ │ STATUS: 🟢 Healthy ││
│ │ ││
│ │ BREAKDOWN: ││
│ │ • Planned maintenance: 8 min ││
│ │ • Incident Jan 15: 5 min ││
│ │ • Deployment issues: 2.1 min ││
│ └─────────────────────────────────────────────────────────┘│
│ │
│ ─────────────────────────────────────────────────────────── │
│ │
│ USING ERROR BUDGET: │
│ ─────────────────── │
│ │
│ BUDGET REMAINING: Fast iteration allowed │
│ • Deploy more frequently │
│ • Try riskier experiments │
│ • Innovate faster │
│ │
│ BUDGET LOW: Focus on reliability │
│ • Freeze non-critical changes │
│ • Fix reliability issues │
│ • Add monitoring/testing │
│ │
│ BUDGET DEPLETED: Reliability work only │
│ • Only critical fixes │
│ • Full focus on stability │
│ • Review what went wrong │
└─────────────────────────────────────────────────────────────┘
SLO Tracking
Monitoring SLOs
SLO DASHBOARD:
┌─────────────────────────────────────────────────────────────┐
│ │
│ SERVICE SLO STATUS │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ ││
│ │ SERVICE SLO CURRENT BUDGET ││
│ │ ─────── ─── ─────── ────── ││
│ │ API Gateway 99.9% 99.95% 🟢 65% ││
│ │ Checkout 99.9% 99.85% 🟡 25% ││
│ │ Search 99.5% 99.7% 🟢 80% ││
│ │ Payments 99.99% 99.98% 🟡 20% ││
│ │ Auth 99.95% 99.91% 🔴 5% ││
│ │ ││
│ └─────────────────────────────────────────────────────────┘│
│ │
│ ─────────────────────────────────────────────────────────── │
│ │
│ DETAILED VIEW: │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ AUTH SERVICE ││
│ │ ││
│ │ AVAILABILITY: ││
│ │ SLO: 99.95% Current: 99.91% Status: 🔴 At Risk ││
│ │ ││
│ │ ERROR BUDGET (30 days): ││
│ │ ████████████████████████████████████████░░ 95% ││
│ │ ││
│ │ Budget: 21.6 min | Used: 20.5 min | Left: 1.1 min ││
│ │ ││
│ │ LATENCY (p95): ││
│ │ SLO: < 100ms Current: 85ms Status: 🟢 OK ││
│ │ ││
│ │ RECENT INCIDENTS: ││
│ │ • Jan 20: 12 min outage (capacity issue) ││
│ │ • Jan 15: 5 min degradation (deployment) ││
│ │ • Jan 8: 3.5 min timeout (dependency) ││
│ │ ││
│ │ RECOMMENDED ACTION: ││
│ │ Freeze deployments, focus on reliability work ││
│ └─────────────────────────────────────────────────────────┘│
└─────────────────────────────────────────────────────────────┘
SLO-Driven Decisions
Prioritizing Work
SLO-INFORMED PLANNING:
┌─────────────────────────────────────────────────────────────┐
│ │
│ SPRINT PLANNING WITH SLOs: │
│ ────────────────────────── │
│ │
│ BUDGET HEALTHY: │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ SPRINT 15 ALLOCATION ││
│ │ ││
│ │ Features: 70% ██████████████████████████████████ ││
│ │ Tech Debt: 20% ████████████ ││
│ │ Reliability: 10% ██████ ││
│ │ ││
│ │ Error budget healthy - Full speed ahead ││
│ └─────────────────────────────────────────────────────────┘│
│ │
│ BUDGET AT RISK: │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ SPRINT 15 ALLOCATION ││
│ │ ││
│ │ Features: 40% ████████████████ ││
│ │ Tech Debt: 20% ████████ ││
│ │ Reliability: 40% ████████████████ ││
│ │ ││
│ │ Increase reliability investment ││
│ └─────────────────────────────────────────────────────────┘│
│ │
│ BUDGET DEPLETED: │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ SPRINT 15 ALLOCATION ││
│ │ ││
│ │ Features: 0% ││
│ │ Critical: 20% ████████ ││
│ │ Reliability: 80% ████████████████████████████████ ││
│ │ ││
│ │ Freeze features, fix reliability ││
│ └─────────────────────────────────────────────────────────┘│
│ │
│ ─────────────────────────────────────────────────────────── │
│ │
│ RELIABILITY WORK EXAMPLES: │
│ ────────────────────────── │
│ • Add redundancy │
│ • Improve monitoring │
│ • Add circuit breakers │
│ • Performance optimization │
│ • Chaos testing │
│ • Reduce deployment risk │
│ • Address tech debt │
└─────────────────────────────────────────────────────────────┘
Team Accountability
Owning SLOs
SLO OWNERSHIP:
┌─────────────────────────────────────────────────────────────┐
│ │
│ WHO OWNS SLOs: │
│ ────────────── │
│ The team that owns the service │
│ │
│ RESPONSIBILITIES: │
│ • Define appropriate SLOs │
│ • Monitor SLO status │
│ • React when budget at risk │
│ • Propose changes when SLOs too loose/tight │
│ │
│ ─────────────────────────────────────────────────────────── │
│ │
│ SLO REVIEW (Monthly): │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ TEAM: Checkout Team ││
│ │ DATE: January 31, 2025 ││
│ │ ││
│ │ SLO PERFORMANCE: ││
│ │ • Availability: 99.92% (SLO: 99.9%) ✅ ││
│ │ • Latency p95: 185ms (SLO: 200ms) ✅ ││
│ │ • Error rate: 0.08% (SLO: 0.1%) ✅ ││
│ │ ││
│ │ ERROR BUDGET: ││
│ │ • Started with: 43.2 min ││
│ │ • Used: 34.5 min (80%) ││
│ │ • Remaining: 8.7 min ││
│ │ ││
│ │ INCIDENTS: ││
│ │ • 2 incidents consumed 31 min ││
│ │ • Root causes addressed ││
│ │ ││
│ │ NEXT MONTH: ││
│ │ • Add database connection pooling ││
│ │ • Improve timeout handling ││
│ │ • Increase capacity headroom ││
│ │ ││
│ │ SLO CHANGES: ││
│ │ • None proposed (current SLOs appropriate) ││
│ └─────────────────────────────────────────────────────────┘│
│ │
│ TRACKING IN GITSCRUM: │
│ ───────────────────── │
│ • Reliability stories tagged [SLO] │
│ • Error budget visible in dashboard │
│ • SLO work in sprint capacity │
└─────────────────────────────────────────────────────────────┘