Try free
9 min read Guide 821 of 877

Service Level Objectives

Reliability is a feature. GitScrum helps teams track SLOs alongside feature work, ensuring reliability investments are visible and prioritized.

SLO Fundamentals

SLI, SLO, SLA

SERVICE LEVEL TERMINOLOGY:
┌─────────────────────────────────────────────────────────────┐
│                                                             │
│ SLI (Service Level Indicator):                             │
│ ──────────────────────────────                              │
│ The metric being measured                                 │
│                                                             │
│ Examples:                                                   │
│ • Request latency (p95)                                   │
│ • Availability (successful requests / total requests)     │
│ • Error rate                                               │
│ • Throughput                                               │
│                                                             │
│ ─────────────────────────────────────────────────────────── │
│                                                             │
│ SLO (Service Level Objective):                             │
│ ──────────────────────────────                              │
│ The target value for the SLI                              │
│                                                             │
│ Examples:                                                   │
│ • Latency p95 < 200ms                                     │
│ • Availability >= 99.9%                                   │
│ • Error rate < 0.1%                                       │
│                                                             │
│ ─────────────────────────────────────────────────────────── │
│                                                             │
│ SLA (Service Level Agreement):                             │
│ ──────────────────────────────                              │
│ Contract with consequences                                │
│                                                             │
│ Examples:                                                   │
│ • "99.9% uptime or customer gets credit"                 │
│ • Legal commitment                                        │
│ • Usually looser than internal SLO                        │
│                                                             │
│ ─────────────────────────────────────────────────────────── │
│                                                             │
│ RELATIONSHIP:                                               │
│                                                             │
│ ┌─────────────────────────────────────────────────────────┐│
│ │                                                         ││
│ │         SLI ──────→ SLO ──────→ SLA                    ││
│ │      (measure)   (target)   (contract)                 ││
│ │                                                         ││
│ │  Example:                                               ││
│ │  "Request latency" → "p95 < 200ms" → "99% < 500ms"   ││
│ │                                                         ││
│ │  SLO is stricter than SLA (internal buffer)            ││
│ │                                                         ││
│ └─────────────────────────────────────────────────────────┘│
└─────────────────────────────────────────────────────────────┘

Defining SLOs

Choosing Metrics

CHOOSING SLIs:
┌─────────────────────────────────────────────────────────────┐
│                                                             │
│ USER-CENTRIC SLIs:                                          │
│ ──────────────────                                          │
│ Choose metrics that reflect user experience               │
│                                                             │
│ AVAILABILITY:                                               │
│ "Can users use the service?"                              │
│ SLI: Successful requests / Total requests                 │
│ SLO: 99.9% (43.8 minutes downtime/month allowed)         │
│                                                             │
│ LATENCY:                                                    │
│ "How fast does it respond?"                               │
│ SLI: Request duration percentiles                         │
│ SLO: p50 < 100ms, p95 < 200ms, p99 < 500ms               │
│                                                             │
│ CORRECTNESS:                                                │
│ "Does it return the right answer?"                        │
│ SLI: Correct responses / Total responses                  │
│ SLO: 99.99% correct                                       │
│                                                             │
│ FRESHNESS:                                                  │
│ "How recent is the data?"                                 │
│ SLI: Data age                                              │
│ SLO: Data updated within 60 seconds                       │
│                                                             │
│ ─────────────────────────────────────────────────────────── │
│                                                             │
│ COMMON SLOs:                                                │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ SERVICE     SLI               SLO                      ││
│ │ ───────     ───               ───                      ││
│ │ API         Availability      99.9%                    ││
│ │ API         Latency (p95)     < 200ms                 ││
│ │ Website     Page load (p95)   < 3s                    ││
│ │ Database    Query time (p95)  < 50ms                  ││
│ │ Checkout    Success rate      99.5%                   ││
│ │ Search      Latency (p95)     < 500ms                 ││
│ └─────────────────────────────────────────────────────────┘│
│                                                             │
│ DON'T AIM FOR 100%:                                         │
│ ─────────────────────                                       │
│ 100% availability = No deployments, no changes            │
│ Leave room for error budget                               │
└─────────────────────────────────────────────────────────────┘

Error Budgets

Managing Reliability

ERROR BUDGET CONCEPT:
┌─────────────────────────────────────────────────────────────┐
│                                                             │
│ WHAT IS ERROR BUDGET:                                       │
│ ─────────────────────                                       │
│ The allowed amount of unreliability                       │
│                                                             │
│ SLO: 99.9% availability                                   │
│ Error Budget: 0.1% (100% - 99.9%)                         │
│                                                             │
│ Per month (30 days):                                       │
│ 0.1% × 43,200 minutes = 43.2 minutes of allowed downtime │
│                                                             │
│ ─────────────────────────────────────────────────────────── │
│                                                             │
│ ERROR BUDGET STATUS:                                        │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ CHECKOUT SERVICE - January                             ││
│ │                                                         ││
│ │ SLO: 99.9% availability                                ││
│ │ Budget: 43.2 minutes                                   ││
│ │                                                         ││
│ │ USAGE THIS MONTH:                                        ││
│ │ ████████████░░░░░░░░░░░░░░░░░  35%                     ││
│ │                                                         ││
│ │ Used: 15.1 minutes                                     ││
│ │ Remaining: 28.1 minutes                                ││
│ │                                                         ││
│ │ STATUS: 🟢 Healthy                                      ││
│ │                                                         ││
│ │ BREAKDOWN:                                               ││
│ │ • Planned maintenance: 8 min                          ││
│ │ • Incident Jan 15: 5 min                              ││
│ │ • Deployment issues: 2.1 min                          ││
│ └─────────────────────────────────────────────────────────┘│
│                                                             │
│ ─────────────────────────────────────────────────────────── │
│                                                             │
│ USING ERROR BUDGET:                                         │
│ ───────────────────                                         │
│                                                             │
│ BUDGET REMAINING: Fast iteration allowed                  │
│ • Deploy more frequently                                  │
│ • Try riskier experiments                                 │
│ • Innovate faster                                          │
│                                                             │
│ BUDGET LOW: Focus on reliability                          │
│ • Freeze non-critical changes                             │
│ • Fix reliability issues                                  │
│ • Add monitoring/testing                                  │
│                                                             │
│ BUDGET DEPLETED: Reliability work only                    │
│ • Only critical fixes                                     │
│ • Full focus on stability                                 │
│ • Review what went wrong                                  │
└─────────────────────────────────────────────────────────────┘

SLO Tracking

Monitoring SLOs

SLO DASHBOARD:
┌─────────────────────────────────────────────────────────────┐
│                                                             │
│ SERVICE SLO STATUS                                          │
│ ┌─────────────────────────────────────────────────────────┐│
│ │                                                         ││
│ │ SERVICE          SLO          CURRENT    BUDGET        ││
│ │ ───────          ───          ───────    ──────        ││
│ │ API Gateway      99.9%        99.95%     🟢 65%        ││
│ │ Checkout         99.9%        99.85%     🟡 25%        ││
│ │ Search           99.5%        99.7%      🟢 80%        ││
│ │ Payments         99.99%       99.98%     🟡 20%        ││
│ │ Auth             99.95%       99.91%     🔴 5%         ││
│ │                                                         ││
│ └─────────────────────────────────────────────────────────┘│
│                                                             │
│ ─────────────────────────────────────────────────────────── │
│                                                             │
│ DETAILED VIEW:                                              │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ AUTH SERVICE                                            ││
│ │                                                         ││
│ │ AVAILABILITY:                                            ││
│ │ SLO: 99.95%  Current: 99.91%  Status: 🔴 At Risk      ││
│ │                                                         ││
│ │ ERROR BUDGET (30 days):                                  ││
│ │ ████████████████████████████████████████░░  95%        ││
│ │                                                         ││
│ │ Budget: 21.6 min | Used: 20.5 min | Left: 1.1 min    ││
│ │                                                         ││
│ │ LATENCY (p95):                                          ││
│ │ SLO: < 100ms  Current: 85ms  Status: 🟢 OK            ││
│ │                                                         ││
│ │ RECENT INCIDENTS:                                        ││
│ │ • Jan 20: 12 min outage (capacity issue)              ││
│ │ • Jan 15: 5 min degradation (deployment)              ││
│ │ • Jan 8: 3.5 min timeout (dependency)                 ││
│ │                                                         ││
│ │ RECOMMENDED ACTION:                                      ││
│ │ Freeze deployments, focus on reliability work          ││
│ └─────────────────────────────────────────────────────────┘│
└─────────────────────────────────────────────────────────────┘

SLO-Driven Decisions

Prioritizing Work

SLO-INFORMED PLANNING:
┌─────────────────────────────────────────────────────────────┐
│                                                             │
│ SPRINT PLANNING WITH SLOs:                                 │
│ ──────────────────────────                                  │
│                                                             │
│ BUDGET HEALTHY:                                             │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ SPRINT 15 ALLOCATION                                    ││
│ │                                                         ││
│ │ Features:     70%  ██████████████████████████████████  ││
│ │ Tech Debt:    20%  ████████████                        ││
│ │ Reliability:  10%  ██████                              ││
│ │                                                         ││
│ │ Error budget healthy - Full speed ahead                ││
│ └─────────────────────────────────────────────────────────┘│
│                                                             │
│ BUDGET AT RISK:                                             │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ SPRINT 15 ALLOCATION                                    ││
│ │                                                         ││
│ │ Features:     40%  ████████████████                    ││
│ │ Tech Debt:    20%  ████████                            ││
│ │ Reliability:  40%  ████████████████                    ││
│ │                                                         ││
│ │ Increase reliability investment                        ││
│ └─────────────────────────────────────────────────────────┘│
│                                                             │
│ BUDGET DEPLETED:                                            │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ SPRINT 15 ALLOCATION                                    ││
│ │                                                         ││
│ │ Features:     0%                                       ││
│ │ Critical:     20%  ████████                            ││
│ │ Reliability:  80%  ████████████████████████████████    ││
│ │                                                         ││
│ │ Freeze features, fix reliability                       ││
│ └─────────────────────────────────────────────────────────┘│
│                                                             │
│ ─────────────────────────────────────────────────────────── │
│                                                             │
│ RELIABILITY WORK EXAMPLES:                                  │
│ ──────────────────────────                                  │
│ • Add redundancy                                          │
│ • Improve monitoring                                      │
│ • Add circuit breakers                                    │
│ • Performance optimization                                │
│ • Chaos testing                                            │
│ • Reduce deployment risk                                  │
│ • Address tech debt                                       │
└─────────────────────────────────────────────────────────────┘

Team Accountability

Owning SLOs

SLO OWNERSHIP:
┌─────────────────────────────────────────────────────────────┐
│                                                             │
│ WHO OWNS SLOs:                                              │
│ ──────────────                                              │
│ The team that owns the service                            │
│                                                             │
│ RESPONSIBILITIES:                                           │
│ • Define appropriate SLOs                                 │
│ • Monitor SLO status                                      │
│ • React when budget at risk                               │
│ • Propose changes when SLOs too loose/tight               │
│                                                             │
│ ─────────────────────────────────────────────────────────── │
│                                                             │
│ SLO REVIEW (Monthly):                                       │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ TEAM: Checkout Team                                     ││
│ │ DATE: January 31, 2025                                  ││
│ │                                                         ││
│ │ SLO PERFORMANCE:                                         ││
│ │ • Availability: 99.92% (SLO: 99.9%) ✅                 ││
│ │ • Latency p95: 185ms (SLO: 200ms) ✅                   ││
│ │ • Error rate: 0.08% (SLO: 0.1%) ✅                     ││
│ │                                                         ││
│ │ ERROR BUDGET:                                            ││
│ │ • Started with: 43.2 min                               ││
│ │ • Used: 34.5 min (80%)                                 ││
│ │ • Remaining: 8.7 min                                   ││
│ │                                                         ││
│ │ INCIDENTS:                                               ││
│ │ • 2 incidents consumed 31 min                         ││
│ │ • Root causes addressed                                ││
│ │                                                         ││
│ │ NEXT MONTH:                                              ││
│ │ • Add database connection pooling                      ││
│ │ • Improve timeout handling                             ││
│ │ • Increase capacity headroom                           ││
│ │                                                         ││
│ │ SLO CHANGES:                                             ││
│ │ • None proposed (current SLOs appropriate)             ││
│ └─────────────────────────────────────────────────────────┘│
│                                                             │
│ TRACKING IN GITSCRUM:                                       │
│ ─────────────────────                                       │
│ • Reliability stories tagged [SLO]                        │
│ • Error budget visible in dashboard                       │
│ • SLO work in sprint capacity                             │
└─────────────────────────────────────────────────────────────┘