9 min read • Guide 768 of 877
Monitoring and Observability Projects
Good monitoring prevents problems and speeds up debugging. GitScrum helps teams plan observability work and track monitoring improvements alongside feature development.
Observability Fundamentals
Three Pillars
OBSERVABILITY PILLARS:
┌─────────────────────────────────────────────────────────────┐
│ │
│ METRICS: │
│ Numerical measurements over time │
│ • Request count, latency, error rate │
│ • CPU, memory, disk usage │
│ • Business metrics (orders, signups) │
│ │
│ Use for: Dashboards, alerting, trends │
│ │
│ ─────────────────────────────────────────────────────────── │
│ │
│ LOGS: │
│ Discrete events with context │
│ • Request details │
│ • Errors with stack traces │
│ • Audit events │
│ │
│ Use for: Debugging, auditing, investigation │
│ │
│ ─────────────────────────────────────────────────────────── │
│ │
│ TRACES: │
│ Request flow across services │
│ • End-to-end latency breakdown │
│ • Service dependencies │
│ • Bottleneck identification │
│ │
│ Use for: Distributed debugging, performance analysis │
│ │
│ ─────────────────────────────────────────────────────────── │
│ │
│ TOGETHER: │
│ Alert triggers (metric) → Dashboard context (metrics) │
│ → Investigate logs → Trace specific request │
└─────────────────────────────────────────────────────────────┘
Feature Observability
Monitoring in Features
OBSERVABILITY IN FEATURE TASKS:
┌─────────────────────────────────────────────────────────────┐
│ │
│ FEATURE WITH OBSERVABILITY: │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ PROJ-200: Payment Processing ││
│ │ ││
│ │ FUNCTIONAL REQUIREMENTS: ││
│ │ ☐ Process credit card payments ││
│ │ ☐ Handle failures gracefully ││
│ │ ☐ Send confirmation email ││
│ │ ││
│ │ OBSERVABILITY REQUIREMENTS: ││
│ │ ││
│ │ METRICS: ││
│ │ ☐ payment_attempts_total (counter) ││
│ │ ☐ payment_success_total (counter) ││
│ │ ☐ payment_failure_total (counter, by reason) ││
│ │ ☐ payment_amount_total (counter) ││
│ │ ☐ payment_processing_duration (histogram) ││
│ │ ││
│ │ LOGS: ││
│ │ ☐ Payment initiated (user_id, amount) ││
│ │ ☐ Payment result (success/failure, reason) ││
│ │ ☐ NO sensitive data (card numbers) ││
│ │ ││
│ │ ALERTS: ││
│ │ ☐ Payment failure rate > 5% ││
│ │ ☐ Payment latency p99 > 5s ││
│ │ ☐ Payment gateway errors ││
│ │ ││
│ │ DASHBOARD: ││
│ │ ☐ Payment overview panel ││
│ └─────────────────────────────────────────────────────────┘│
│ │
│ DEFINITION OF DONE INCLUDES: │
│ ☐ Metrics exposed │
│ ☐ Logs structured with request context │
│ ☐ Alerts configured │
│ ☐ Dashboard updated │
└─────────────────────────────────────────────────────────────┘
Monitoring Projects
Dedicated Observability Work
OBSERVABILITY IMPROVEMENT EPIC:
┌─────────────────────────────────────────────────────────────┐
│ │
│ OBS-Q1: Q1 Observability Improvements │
│ │
│ GOAL: Reduce MTTR by 50% │
│ │
│ CURRENT STATE: │
│ • Average time to detect: 15 minutes │
│ • Average time to diagnose: 45 minutes │
│ • Many gaps in monitoring │
│ │
│ TARGET STATE: │
│ • Time to detect: < 5 minutes │
│ • Time to diagnose: < 20 minutes │
│ • Comprehensive coverage │
│ │
│ TASKS: │
│ ├── OBS-01: Audit current monitoring gaps │
│ ├── OBS-02: Add missing service metrics │
│ ├── OBS-03: Implement distributed tracing │
│ ├── OBS-04: Create service dashboards │
│ ├── OBS-05: Tune alert thresholds │
│ ├── OBS-06: Add SLO tracking │
│ └── OBS-07: Create runbooks for alerts │
└─────────────────────────────────────────────────────────────┘
Specific Tasks
MONITORING TASK EXAMPLES:
┌─────────────────────────────────────────────────────────────┐
│ │
│ DASHBOARD TASK: │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ OBS-04a: Create API service dashboard ││
│ │ ││
│ │ PANELS: ││
│ │ ☐ Request rate (by endpoint) ││
│ │ ☐ Error rate (by status code) ││
│ │ ☐ Latency percentiles (p50, p95, p99) ││
│ │ ☐ Active connections ││
│ │ ☐ Resource usage (CPU, memory) ││
│ │ ☐ Dependency health (DB, cache, external) ││
│ │ ││
│ │ TIME RANGES: ││
│ │ ☐ Last 1 hour (default) ││
│ │ ☐ Last 24 hours ││
│ │ ☐ Last 7 days ││
│ │ ││
│ │ VARIABLES: ││
│ │ ☐ Environment (prod/staging) ││
│ │ ☐ Instance (for debugging) ││
│ └─────────────────────────────────────────────────────────┘│
│ │
│ ALERT TASK: │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ OBS-05a: Configure API error rate alert ││
│ │ ││
│ │ ALERT DEFINITION: ││
│ │ Condition: Error rate > 1% for 5 minutes ││
│ │ Severity: Warning ││
│ │ ││
│ │ Condition: Error rate > 5% for 2 minutes ││
│ │ Severity: Critical ││
│ │ ││
│ │ NOTIFICATION: ││
│ │ Warning: Slack #alerts ││
│ │ Critical: Slack + PagerDuty ││
│ │ ││
│ │ RUNBOOK: ││
│ │ Link: docs/runbooks/api-error-rate.md ││
│ │ ││
│ │ TESTING: ││
│ │ ☐ Verify alert fires correctly ││
│ │ ☐ Verify notification reaches on-call ││
│ │ ☐ Test runbook steps ││
│ └─────────────────────────────────────────────────────────┘│
└─────────────────────────────────────────────────────────────┘
Alert Management
Alert Hygiene
ALERT MAINTENANCE:
┌─────────────────────────────────────────────────────────────┐
│ │
│ ALERT PROBLEMS: │
│ │
│ ALERT FATIGUE: │
│ Too many alerts → Team ignores them │
│ │
│ NOISY ALERTS: │
│ False positives → Wasted investigation time │
│ │
│ MISSING ALERTS: │
│ Gaps in coverage → Issues not detected │
│ │
│ ─────────────────────────────────────────────────────────── │
│ │
│ ALERT REVIEW TASK (Monthly): │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ OBS-MAINT: Monthly alert review ││
│ │ ││
│ │ REVIEW EACH ALERT: ││
│ │ ││
│ │ How many times fired? _____ ││
│ │ How many actionable? _____ ││
│ │ How many false positives? _____ ││
│ │ ││
│ │ ACTION OPTIONS: ││
│ │ • Keep as-is ││
│ │ • Adjust threshold ││
│ │ • Improve detection logic ││
│ │ • Delete (if not useful) ││
│ │ • Demote severity ││
│ │ • Promote severity ││
│ │ ││
│ │ GAPS IDENTIFIED: ││
│ │ ☐ List missing alerts ││
│ │ ☐ Create tasks for new alerts ││
│ └─────────────────────────────────────────────────────────┘│
│ │
│ GOOD ALERT CRITERIA: │
│ ✅ Actionable (you know what to do) │
│ ✅ Urgent (needs attention now) │
│ ✅ Real (low false positive rate) │
│ ✅ Has runbook │
└─────────────────────────────────────────────────────────────┘
SLOs and SLIs
SLO Implementation
SLO TRACKING TASK:
┌─────────────────────────────────────────────────────────────┐
│ │
│ SLO DEFINITION TASK: │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ OBS-SLO-01: Define API availability SLO ││
│ │ ││
│ │ SERVICE: API ││
│ │ ││
│ │ SLI (Service Level Indicator): ││
│ │ Successful requests / Total requests ││
│ │ (excluding 4xx client errors) ││
│ │ ││
│ │ SLO (Service Level Objective): ││
│ │ 99.9% availability over 30-day window ││
│ │ ││
│ │ ERROR BUDGET: ││
│ │ 0.1% = ~43 minutes downtime per month ││
│ │ ││
│ │ IMPLEMENTATION: ││
│ │ ☐ Define SLI query in Prometheus ││
│ │ ☐ Create SLO dashboard panel ││
│ │ ☐ Set up error budget tracking ││
│ │ ☐ Create burn rate alerts ││
│ │ ││
│ │ BURN RATE ALERTS: ││
│ │ • Fast burn: 14x rate for 1h → Page ││
│ │ • Slow burn: 2x rate for 6h → Warning ││
│ └─────────────────────────────────────────────────────────┘│
│ │
│ SLO DASHBOARD: │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ API SLO Status ││
│ │ ││
│ │ Availability: 99.95% ✅ Within SLO ││
│ │ SLO Target: 99.9% ││
│ │ ││
│ │ Error Budget: ││
│ │ ████████████░░░ 78% remaining (33 min left) ││
│ │ ││
│ │ 30-day trend: ▁▂▁▁▃▁▁▁▂▁▁▁▁▂▁▁▅▁▁▁ ││
│ │ ↑ incident ││
│ └─────────────────────────────────────────────────────────┘│
└─────────────────────────────────────────────────────────────┘
Logging Standards
Structured Logging
LOG IMPROVEMENT TASK:
┌─────────────────────────────────────────────────────────────┐
│ │
│ OBS-LOG-01: Implement structured logging standard │
│ │
│ CURRENT STATE: │
│ • Inconsistent log formats │
│ • Missing context │
│ • Hard to search and correlate │
│ │
│ TARGET STATE: │
│ • JSON structured logs │
│ • Consistent fields │
│ • Request ID correlation │
│ │
│ STANDARD FORMAT: │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ { ││
│ │ "timestamp": "2024-01-15T10:30:00Z", ││
│ │ "level": "info", ││
│ │ "message": "Payment processed", ││
│ │ "service": "api", ││
│ │ "request_id": "abc123", ││
│ │ "user_id": "user_456", ││
│ │ "duration_ms": 234, ││
│ │ "context": { ││
│ │ "payment_id": "pay_789", ││
│ │ "amount": 99.99 ││
│ │ } ││
│ │ } ││
│ └─────────────────────────────────────────────────────────┘│
│ │
│ IMPLEMENTATION: │
│ ☐ Update logging library configuration │
│ ☐ Add request ID middleware │
│ ☐ Update existing log statements │
│ ☐ Document logging standard │
│ ☐ Update log queries in dashboards │
└─────────────────────────────────────────────────────────────┘