Observability Projects | Metrics, Logs & Alerts
Plan and implement monitoring with metrics, logs, traces, and SLOs. GitScrum tracks observability work alongside feature development for MTTR reduction.
9 min read
Good monitoring prevents problems and speeds up debugging. GitScrum helps teams plan observability work and track monitoring improvements alongside feature development.
Observability Fundamentals
Three Pillars
OBSERVABILITY PILLARS:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β METRICS: β
β Numerical measurements over time β
β β’ Request count, latency, error rate β
β β’ CPU, memory, disk usage β
β β’ Business metrics (orders, signups) β
β β
β Use for: Dashboards, alerting, trends β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β LOGS: β
β Discrete events with context β
β β’ Request details β
β β’ Errors with stack traces β
β β’ Audit events β
β β
β Use for: Debugging, auditing, investigation β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β TRACES: β
β Request flow across services β
β β’ End-to-end latency breakdown β
β β’ Service dependencies β
β β’ Bottleneck identification β
β β
β Use for: Distributed debugging, performance analysis β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β TOGETHER: β
β Alert triggers (metric) β Dashboard context (metrics) β
β β Investigate logs β Trace specific request β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Feature Observability
Monitoring in Features
OBSERVABILITY IN FEATURE TASKS:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β FEATURE WITH OBSERVABILITY: β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β PROJ-200: Payment Processing ββ
β β ββ
β β FUNCTIONAL REQUIREMENTS: ββ
β β β Process credit card payments ββ
β β β Handle failures gracefully ββ
β β β Send confirmation email ββ
β β ββ
β β OBSERVABILITY REQUIREMENTS: ββ
β β ββ
β β METRICS: ββ
β β β payment_attempts_total (counter) ββ
β β β payment_success_total (counter) ββ
β β β payment_failure_total (counter, by reason) ββ
β β β payment_amount_total (counter) ββ
β β β payment_processing_duration (histogram) ββ
β β ββ
β β LOGS: ββ
β β β Payment initiated (user_id, amount) ββ
β β β Payment result (success/failure, reason) ββ
β β β NO sensitive data (card numbers) ββ
β β ββ
β β ALERTS: ββ
β β β Payment failure rate > 5% ββ
β β β Payment latency p99 > 5s ββ
β β β Payment gateway errors ββ
β β ββ
β β DASHBOARD: ββ
β β β Payment overview panel ββ
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β DEFINITION OF DONE INCLUDES: β
β β Metrics exposed β
β β Logs structured with request context β
β β Alerts configured β
β β Dashboard updated β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Monitoring Projects
Dedicated Observability Work
OBSERVABILITY IMPROVEMENT EPIC:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β OBS-Q1: Q1 Observability Improvements β
β β
β GOAL: Reduce MTTR by 50% β
β β
β CURRENT STATE: β
β β’ Average time to detect: 15 minutes β
β β’ Average time to diagnose: 45 minutes β
β β’ Many gaps in monitoring β
β β
β TARGET STATE: β
β β’ Time to detect: < 5 minutes β
β β’ Time to diagnose: < 20 minutes β
β β’ Comprehensive coverage β
β β
β TASKS: β
β βββ OBS-01: Audit current monitoring gaps β
β βββ OBS-02: Add missing service metrics β
β βββ OBS-03: Implement distributed tracing β
β βββ OBS-04: Create service dashboards β
β βββ OBS-05: Tune alert thresholds β
β βββ OBS-06: Add SLO tracking β
β βββ OBS-07: Create runbooks for alerts β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Specific Tasks
MONITORING TASK EXAMPLES:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β DASHBOARD TASK: β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β OBS-04a: Create API service dashboard ββ
β β ββ
β β PANELS: ββ
β β β Request rate (by endpoint) ββ
β β β Error rate (by status code) ββ
β β β Latency percentiles (p50, p95, p99) ββ
β β β Active connections ββ
β β β Resource usage (CPU, memory) ββ
β β β Dependency health (DB, cache, external) ββ
β β ββ
β β TIME RANGES: ββ
β β β Last 1 hour (default) ββ
β β β Last 24 hours ββ
β β β Last 7 days ββ
β β ββ
β β VARIABLES: ββ
β β β Environment (prod/staging) ββ
β β β Instance (for debugging) ββ
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β ALERT TASK: β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β OBS-05a: Configure API error rate alert ββ
β β ββ
β β ALERT DEFINITION: ββ
β β Condition: Error rate > 1% for 5 minutes ββ
β β Severity: Warning ββ
β β ββ
β β Condition: Error rate > 5% for 2 minutes ββ
β β Severity: Critical ββ
β β ββ
β β NOTIFICATION: ββ
β β Warning: Slack #alerts ββ
β β Critical: Slack + PagerDuty ββ
β β ββ
β β RUNBOOK: ββ
β β Link: docs/runbooks/api-error-rate.md ββ
β β ββ
β β TESTING: ββ
β β β Verify alert fires correctly ββ
β β β Verify notification reaches on-call ββ
β β β Test runbook steps ββ
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Alert Management
Alert Hygiene
ALERT MAINTENANCE:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β ALERT PROBLEMS: β
β β
β ALERT FATIGUE: β
β Too many alerts β Team ignores them β
β β
β NOISY ALERTS: β
β False positives β Wasted investigation time β
β β
β MISSING ALERTS: β
β Gaps in coverage β Issues not detected β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β ALERT REVIEW TASK (Monthly): β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β OBS-MAINT: Monthly alert review ββ
β β ββ
β β REVIEW EACH ALERT: ββ
β β ββ
β β How many times fired? _____ ββ
β β How many actionable? _____ ββ
β β How many false positives? _____ ββ
β β ββ
β β ACTION OPTIONS: ββ
β β β’ Keep as-is ββ
β β β’ Adjust threshold ββ
β β β’ Improve detection logic ββ
β β β’ Delete (if not useful) ββ
β β β’ Demote severity ββ
β β β’ Promote severity ββ
β β ββ
β β GAPS IDENTIFIED: ββ
β β β List missing alerts ββ
β β β Create tasks for new alerts ββ
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β GOOD ALERT CRITERIA: β
β β
Actionable (you know what to do) β
β β
Urgent (needs attention now) β
β β
Real (low false positive rate) β
β β
Has runbook β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
SLOs and SLIs
SLO Implementation
SLO TRACKING TASK:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β SLO DEFINITION TASK: β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β OBS-SLO-01: Define API availability SLO ββ
β β ββ
β β SERVICE: API ββ
β β ββ
β β SLI (Service Level Indicator): ββ
β β Successful requests / Total requests ββ
β β (excluding 4xx client errors) ββ
β β ββ
β β SLO (Service Level Objective): ββ
β β 99.9% availability over 30-day window ββ
β β ββ
β β ERROR BUDGET: ββ
β β 0.1% = ~43 minutes downtime per month ββ
β β ββ
β β IMPLEMENTATION: ββ
β β β Define SLI query in Prometheus ββ
β β β Create SLO dashboard panel ββ
β β β Set up error budget tracking ββ
β β β Create burn rate alerts ββ
β β ββ
β β BURN RATE ALERTS: ββ
β β β’ Fast burn: 14x rate for 1h β Page ββ
β β β’ Slow burn: 2x rate for 6h β Warning ββ
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β SLO DASHBOARD: β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β API SLO Status ββ
β β ββ
β β Availability: 99.95% β
Within SLO ββ
β β SLO Target: 99.9% ββ
β β ββ
β β Error Budget: ββ
β β βββββββββββββββ 78% remaining (33 min left) ββ
β β ββ
β β 30-day trend: βββββββββββββββββ
βββ ββ
β β β incident ββ
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Logging Standards
Structured Logging
LOG IMPROVEMENT TASK:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β OBS-LOG-01: Implement structured logging standard β
β β
β CURRENT STATE: β
β β’ Inconsistent log formats β
β β’ Missing context β
β β’ Hard to search and correlate β
β β
β TARGET STATE: β
β β’ JSON structured logs β
β β’ Consistent fields β
β β’ Request ID correlation β
β β
β STANDARD FORMAT: β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β { ββ
β β "timestamp": "2024-01-15T10:30:00Z", ββ
β β "level": "info", ββ
β β "message": "Payment processed", ββ
β β "service": "api", ββ
β β "request_id": "abc123", ββ
β β "user_id": "user_456", ββ
β β "duration_ms": 234, ββ
β β "context": { ββ
β β "payment_id": "pay_789", ββ
β β "amount": 99.99 ββ
β β } ββ
β β } ββ
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β IMPLEMENTATION: β
β β Update logging library configuration β
β β Add request ID middleware β
β β Update existing log statements β
β β Document logging standard β
β β Update log queries in dashboards β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ