Service Level Objectives | SLO & Error Budget Guide
Define SLIs, SLOs, and error budgets for reliable applications. GitScrum tracks SLO status, budget consumption, and reliability work prioritization.
9 min read
Reliability is a feature. GitScrum helps teams track SLOs alongside feature work, ensuring reliability investments are visible and prioritized.
SLO Fundamentals
SLI, SLO, SLA
SERVICE LEVEL TERMINOLOGY:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β SLI (Service Level Indicator): β
β ββββββββββββββββββββββββββββββ β
β The metric being measured β
β β
β Examples: β
β β’ Request latency (p95) β
β β’ Availability (successful requests / total requests) β
β β’ Error rate β
β β’ Throughput β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β SLO (Service Level Objective): β
β ββββββββββββββββββββββββββββββ β
β The target value for the SLI β
β β
β Examples: β
β β’ Latency p95 < 200ms β
β β’ Availability >= 99.9% β
β β’ Error rate < 0.1% β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β SLA (Service Level Agreement): β
β ββββββββββββββββββββββββββββββ β
β Contract with consequences β
β β
β Examples: β
β β’ "99.9% uptime or customer gets credit" β
β β’ Legal commitment β
β β’ Usually looser than internal SLO β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β RELATIONSHIP: β
β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β ββ
β β SLI βββββββ SLO βββββββ SLA ββ
β β (measure) (target) (contract) ββ
β β ββ
β β Example: ββ
β β "Request latency" β "p95 < 200ms" β "99% < 500ms" ββ
β β ββ
β β SLO is stricter than SLA (internal buffer) ββ
β β ββ
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Defining SLOs
Choosing Metrics
CHOOSING SLIs:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β USER-CENTRIC SLIs: β
β ββββββββββββββββββ β
β Choose metrics that reflect user experience β
β β
β AVAILABILITY: β
β "Can users use the service?" β
β SLI: Successful requests / Total requests β
β SLO: 99.9% (43.8 minutes downtime/month allowed) β
β β
β LATENCY: β
β "How fast does it respond?" β
β SLI: Request duration percentiles β
β SLO: p50 < 100ms, p95 < 200ms, p99 < 500ms β
β β
β CORRECTNESS: β
β "Does it return the right answer?" β
β SLI: Correct responses / Total responses β
β SLO: 99.99% correct β
β β
β FRESHNESS: β
β "How recent is the data?" β
β SLI: Data age β
β SLO: Data updated within 60 seconds β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β COMMON SLOs: β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β SERVICE SLI SLO ββ
β β βββββββ βββ βββ ββ
β β API Availability 99.9% ββ
β β API Latency (p95) < 200ms ββ
β β Website Page load (p95) < 3s ββ
β β Database Query time (p95) < 50ms ββ
β β Checkout Success rate 99.5% ββ
β β Search Latency (p95) < 500ms ββ
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β DON'T AIM FOR 100%: β
β βββββββββββββββββββββ β
β 100% availability = No deployments, no changes β
β Leave room for error budget β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Error Budgets
Managing Reliability
ERROR BUDGET CONCEPT:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β WHAT IS ERROR BUDGET: β
β βββββββββββββββββββββ β
β The allowed amount of unreliability β
β β
β SLO: 99.9% availability β
β Error Budget: 0.1% (100% - 99.9%) β
β β
β Per month (30 days): β
β 0.1% Γ 43,200 minutes = 43.2 minutes of allowed downtime β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β ERROR BUDGET STATUS: β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β CHECKOUT SERVICE - January ββ
β β ββ
β β SLO: 99.9% availability ββ
β β Budget: 43.2 minutes ββ
β β ββ
β β USAGE THIS MONTH: ββ
β β βββββββββββββββββββββββββββββ 35% ββ
β β ββ
β β Used: 15.1 minutes ββ
β β Remaining: 28.1 minutes ββ
β β ββ
β β STATUS: π’ Healthy ββ
β β ββ
β β BREAKDOWN: ββ
β β β’ Planned maintenance: 8 min ββ
β β β’ Incident Jan 15: 5 min ββ
β β β’ Deployment issues: 2.1 min ββ
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β USING ERROR BUDGET: β
β βββββββββββββββββββ β
β β
β BUDGET REMAINING: Fast iteration allowed β
β β’ Deploy more frequently β
β β’ Try riskier experiments β
β β’ Innovate faster β
β β
β BUDGET LOW: Focus on reliability β
β β’ Freeze non-critical changes β
β β’ Fix reliability issues β
β β’ Add monitoring/testing β
β β
β BUDGET DEPLETED: Reliability work only β
β β’ Only critical fixes β
β β’ Full focus on stability β
β β’ Review what went wrong β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
SLO Tracking
Monitoring SLOs
SLO DASHBOARD:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β SERVICE SLO STATUS β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β ββ
β β SERVICE SLO CURRENT BUDGET ββ
β β βββββββ βββ βββββββ ββββββ ββ
β β API Gateway 99.9% 99.95% π’ 65% ββ
β β Checkout 99.9% 99.85% π‘ 25% ββ
β β Search 99.5% 99.7% π’ 80% ββ
β β Payments 99.99% 99.98% π‘ 20% ββ
β β Auth 99.95% 99.91% π΄ 5% ββ
β β ββ
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β DETAILED VIEW: β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β AUTH SERVICE ββ
β β ββ
β β AVAILABILITY: ββ
β β SLO: 99.95% Current: 99.91% Status: π΄ At Risk ββ
β β ββ
β β ERROR BUDGET (30 days): ββ
β β ββββββββββββββββββββββββββββββββββββββββββ 95% ββ
β β ββ
β β Budget: 21.6 min | Used: 20.5 min | Left: 1.1 min ββ
β β ββ
β β LATENCY (p95): ββ
β β SLO: < 100ms Current: 85ms Status: π’ OK ββ
β β ββ
β β RECENT INCIDENTS: ββ
β β β’ Jan 20: 12 min outage (capacity issue) ββ
β β β’ Jan 15: 5 min degradation (deployment) ββ
β β β’ Jan 8: 3.5 min timeout (dependency) ββ
β β ββ
β β RECOMMENDED ACTION: ββ
β β Freeze deployments, focus on reliability work ββ
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
SLO-Driven Decisions
Prioritizing Work
SLO-INFORMED PLANNING:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β SPRINT PLANNING WITH SLOs: β
β ββββββββββββββββββββββββββ β
β β
β BUDGET HEALTHY: β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β SPRINT 15 ALLOCATION ββ
β β ββ
β β Features: 70% ββββββββββββββββββββββββββββββββββ ββ
β β Tech Debt: 20% ββββββββββββ ββ
β β Reliability: 10% ββββββ ββ
β β ββ
β β Error budget healthy - Full speed ahead ββ
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β BUDGET AT RISK: β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β SPRINT 15 ALLOCATION ββ
β β ββ
β β Features: 40% ββββββββββββββββ ββ
β β Tech Debt: 20% ββββββββ ββ
β β Reliability: 40% ββββββββββββββββ ββ
β β ββ
β β Increase reliability investment ββ
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β BUDGET DEPLETED: β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β SPRINT 15 ALLOCATION ββ
β β ββ
β β Features: 0% ββ
β β Critical: 20% ββββββββ ββ
β β Reliability: 80% ββββββββββββββββββββββββββββββββ ββ
β β ββ
β β Freeze features, fix reliability ββ
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β RELIABILITY WORK EXAMPLES: β
β ββββββββββββββββββββββββββ β
β β’ Add redundancy β
β β’ Improve monitoring β
β β’ Add circuit breakers β
β β’ Performance optimization β
β β’ Chaos testing β
β β’ Reduce deployment risk β
β β’ Address tech debt β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Team Accountability
Owning SLOs
SLO OWNERSHIP:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β WHO OWNS SLOs: β
β ββββββββββββββ β
β The team that owns the service β
β β
β RESPONSIBILITIES: β
β β’ Define appropriate SLOs β
β β’ Monitor SLO status β
β β’ React when budget at risk β
β β’ Propose changes when SLOs too loose/tight β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β SLO REVIEW (Monthly): β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β TEAM: Checkout Team ββ
β β DATE: January 31, 2025 ββ
β β ββ
β β SLO PERFORMANCE: ββ
β β β’ Availability: 99.92% (SLO: 99.9%) β
ββ
β β β’ Latency p95: 185ms (SLO: 200ms) β
ββ
β β β’ Error rate: 0.08% (SLO: 0.1%) β
ββ
β β ββ
β β ERROR BUDGET: ββ
β β β’ Started with: 43.2 min ββ
β β β’ Used: 34.5 min (80%) ββ
β β β’ Remaining: 8.7 min ββ
β β ββ
β β INCIDENTS: ββ
β β β’ 2 incidents consumed 31 min ββ
β β β’ Root causes addressed ββ
β β ββ
β β NEXT MONTH: ββ
β β β’ Add database connection pooling ββ
β β β’ Improve timeout handling ββ
β β β’ Increase capacity headroom ββ
β β ββ
β β SLO CHANGES: ββ
β β β’ None proposed (current SLOs appropriate) ββ
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β TRACKING IN GITSCRUM: β
β βββββββββββββββββββββ β
β β’ Reliability stories tagged [SLO] β
β β’ Error budget visible in dashboard β
β β’ SLO work in sprint capacity β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ