07 — Reliability & Operations MOC

← Back to Software Engineering - Map of Content

Keeping systems running, performant, and recoverable. The difference between a demo and production is reliability.


Observability

The Three Pillars

  • Logs — Discrete events with context, structured (JSON) vs unstructured
  • Metrics — Numeric measurements over time (counters, gauges, histograms)
  • Traces — End-to-end request path through distributed services

Logging

  • Structured Logging — Key-value pairs, JSON format, machine-parseable
  • Log Levels — DEBUG, INFO, WARN, ERROR, FATAL
  • Correlation IDs / Request IDs — Trace a request across services
  • Log Aggregation — ELK Stack (Elasticsearch, Logstash, Kibana), Loki + Grafana, Splunk, Datadog
  • Sampling — Log a percentage to manage volume at scale
  • Best Practices — Don’t log PII, include context, structured over printf

Metrics

  • Metric Types — Counter (monotonically increasing), Gauge (current value), Histogram (distribution), Summary
  • RED Method — Rate, Errors, Duration (for request-driven services)
  • USE Method — Utilization, Saturation, Errors (for resources: CPU, memory, disk, network)
  • Four Golden Signals (Google SRE) — Latency, Traffic, Errors, Saturation
  • Tools — Prometheus, Grafana, Datadog, CloudWatch, StatsD, InfluxDB
  • Percentiles — p50, p95, p99, p99.9 — more informative than averages

Distributed Tracing

  • Concepts — Spans, traces, parent-child relationships, baggage/context propagation
  • OpenTelemetry — Vendor-neutral standard for traces, metrics, and logs
  • Jaeger — Open-source distributed tracing
  • Zipkin — Another open-source tracing system
  • AWS X-Ray, Datadog APM — Managed tracing solutions
  • Trace Sampling — Head-based vs tail-based sampling

Dashboards & Alerting

  • Dashboard Design — Hierarchy of information, actionable views, not vanity metrics
  • Grafana — Open-source dashboarding, data source agnostic
  • Alerting Philosophy — Alert on symptoms not causes, reduce noise, actionable alerts
  • Alert Fatigue — Too many alerts → ignored alerts → incidents
  • PagerDuty / OpsGenie / Incident.io — Alert routing, escalation, on-call management

Incident Management

On-Call

  • On-Call Rotations — Primary/secondary, follow-the-sun, fair scheduling
  • Runbooks — Step-by-step guides for common incidents
  • Escalation Policies — When and how to escalate
  • Toil — Repetitive operational work that should be automated

Incident Response

  • Severity Levels — SEV1 (critical) through SEV4 (minor)
  • Incident Commander — Central coordination role during incidents
  • Communication — Status pages, internal updates, customer communication
  • Mitigation vs Root Cause Fix — Stop the bleeding first, fix properly later
  • War Rooms — Real-time coordination (Slack channels, video calls)

Post-Incident Process

  • Blameless Post-Mortems — Focus on systems and processes, not individuals
  • Post-Mortem Template — Summary, timeline, impact, root cause, contributing factors, action items
  • 5 Whys — Root cause analysis technique
  • Action Items — Concrete, assigned, tracked, prioritized
  • Learning Culture — Share post-mortems widely, celebrate learning from failure

Site Reliability Engineering

SLAs, SLOs, SLIs

  • SLI (Service Level Indicator) — Measurable metric (e.g., latency p99 < 200ms)
  • SLO (Service Level Objective) — Target for an SLI (e.g., 99.9% of requests < 200ms)
  • SLA (Service Level Agreement) — Contractual commitment with consequences
  • Choosing SLOs — Based on user expectations, not technical capability
  • The Nines — 99.9% = 8.76h downtime/year, 99.99% = 52.6min/year

Error Budgets

  • Concept — Allowed unreliability = 1 - SLO target
  • Budget Consumption — Track error budget burn rate
  • Budget Policy — When budget is exhausted → focus on reliability over features
  • Burn Rate Alerts — Alert when burning budget too fast

Toil Management

  • Definition — Manual, repetitive, automatable, tactical, no enduring value
  • Toil Budget — SRE teams aim for < 50% toil
  • Automation — Self-healing systems, auto-remediation, infrastructure as code
  • Elimination Strategies — Automate, simplify, eliminate the need

Reliability Patterns

  • Graceful Degradation — Reduced functionality instead of total failure
  • Circuit Breakers — Stop calling failing services (see Distributed Systems)
  • Bulkheads — Isolate failures to prevent cascading
  • Timeouts — Always set timeouts, use adaptive timeouts
  • Retries — Exponential backoff with jitter, idempotency requirement
  • Rate Limiting — Protect services from overload (see API Design)
  • Health Checks — Liveness probes, readiness probes, startup probes
  • Redundancy — No single points of failure, multi-AZ, multi-region

SRE Team Models

  • Embedded SRE — SREs sit within product teams, deep product context
  • Centralized / Platform SRE — Shared SRE team provides tools and platforms for all teams
  • Consulting SRE — SREs advise teams on reliability, teams own their services
  • SRE vs DevOps — SRE is a specific implementation of DevOps principles with prescriptive practices
  • Service Ownership — “You build it, you run it” — teams own production behavior of their services
  • Production Readiness Reviews (PRR) — Checklist before a new service goes to production (monitoring, alerting, runbooks, load testing, failover)

Disaster Recovery & Business Continuity

Backup & Restore

  • Backup Strategies — Full, incremental, differential backups
  • RPO (Recovery Point Objective) — Maximum acceptable data loss (time-based)
  • RTO (Recovery Time Objective) — Maximum acceptable downtime
  • Backup Testing — Regular restore drills, verify backups actually work
  • Point-in-Time Recovery (PITR) — Database WAL replay, continuous archiving

Failover Strategies

  • Active-Passive — Standby takes over on primary failure, cold/warm/hot standby
  • Active-Active — Multiple active regions, traffic split, conflict resolution needed
  • DNS Failover — Route53 health checks, failover routing policies
  • Database Failover — Automatic promotion of replicas, split-brain prevention
  • Multi-Region — Cross-region replication, regional isolation, global load balancing

Disaster Recovery Tiers

  • Backup & Restore — Cheapest, slowest recovery (hours)
  • Pilot Light — Minimal always-on infrastructure, scale up on failure
  • Warm Standby — Scaled-down copy of production, faster failover (minutes)
  • Multi-Site Active-Active — Full redundancy, near-zero downtime, highest cost

Business Continuity

  • DR Runbooks — Step-by-step recovery procedures per service
  • DR Drills — Regular simulated failovers, tabletop exercises
  • Dependency Mapping — Know your critical path and upstream/downstream dependencies
  • Blast Radius Reduction — Cell-based architecture, shuffle sharding, failure domains

Traffic Management

Traffic Shifting

  • Weighted Routing — Gradually shift traffic between deployments (see CI-CD)
  • Traffic Mirroring / Shadowing — Duplicate production traffic to test environment without affecting users
  • Dark Traffic — Send real traffic to new service version, discard responses, compare
  • Geographic Routing — Route users to nearest region, latency-based routing

Load Shedding

  • Priority-Based Shedding — Drop low-priority requests under load to protect critical paths
  • Admission Control — Reject requests early when system is at capacity
  • Throttling — Per-client rate limits, adaptive throttling based on system health
  • Backpressure Propagation — Signal upstream to slow down rather than buffering until OOM

Configuration Management in Production

  • Feature Flags — Runtime configuration changes without deploys (see CI-CD)
  • Dynamic Configuration — Config servers (Consul, etcd), runtime tuning without restarts
  • Configuration Drift — Detect and remediate drift between intended and actual state
  • Secrets Rotation — Automated rotation of credentials, certificates, API keys

Performance Engineering

Profiling

  • CPU Profiling — Flame graphs, sampling profilers (perf, pprof, py-spy, async-profiler)
  • Memory Profiling — Heap analysis, memory leaks, allocation tracking
  • I/O Profiling — Disk I/O, network I/O, database query analysis
  • Application Performance Monitoring (APM) — Datadog, New Relic, Dynatrace

Benchmarking

  • Micro-Benchmarks — Test individual functions (JMH for Java, BenchmarkDotNet, criterion for Rust)
  • Macro-Benchmarks — System-level performance testing
  • Statistical Rigor — Warm-up, iterations, percentiles, variance
  • Comparison Testing — A/B performance comparisons, regression detection

Load Testing

  • Tools — k6, Locust, Gatling, JMeter, Artillery, wrk
  • Test Types — Load test, stress test, soak test, spike test, breakpoint test
  • Traffic Patterns — Ramp-up, steady state, spike, diurnal patterns
  • Metrics to Watch — Latency percentiles, throughput, error rate, resource utilization

Capacity Planning

  • Forecasting — Historical trends, growth projections
  • Headroom — Buffer for traffic spikes, typically 2-3x average
  • Auto-Scaling — Horizontal (add instances), vertical (bigger instances)
  • Cost Optimization — Right-sizing, reserved instances, spot instances, serverless economics

Chaos Engineering

  • Principles — Define steady state, hypothesize, introduce failures, observe
  • Chaos Monkey — Randomly terminate instances (Netflix)
  • Litmus Chaos / Chaos Mesh — Kubernetes-native chaos tools
  • Game Days — Planned chaos experiments with the team
  • Failure Injection — Network partitions, latency injection, disk failure, DNS failure

Performance Optimization Patterns

  • Caching — See Caching in Data Management
  • Connection Pooling — Reuse database/HTTP connections
  • Async Processing — Move work off the hot path, use queues
  • Batch Processing — Amortize overhead across multiple items
  • Denormalization — Trade storage for read performance
  • Read Replicas — Scale reads horizontally
  • Compression — gzip, Brotli, zstd for network transfer
  • Lazy Loading — Load data only when needed

Cost Engineering / FinOps

Cloud Cost Fundamentals

  • Pay-Per-Use vs Reserved vs Spot — On-demand (flexible, expensive), reserved (1-3yr commitment, 30-60% savings), spot/preemptible (up to 90% savings, can be interrupted)
  • Cost Allocation Tags — Tag resources by team, project, environment; required for cost visibility and showback/chargeback
  • Cost Anomaly Detection — Automated alerts for unexpected spending spikes, AWS Cost Anomaly Detection, budget alerts

Right-Sizing

  • Instance Type Selection — Match instance family to workload (compute-optimized, memory-optimized, general purpose, GPU)
  • CPU / Memory Utilization Analysis — Most instances run at 10-30% utilization; downsizing often has no performance impact
  • Auto-Scaling Efficiency — Scale-to-zero for dev/staging, aggressive scale-down policies, scheduled scaling for predictable patterns

Reserved Capacity

  • Reserved Instances (AWS) — Standard vs convertible, all-upfront/partial/no-upfront, 1yr vs 3yr, scope (regional vs zonal)
  • Savings Plans (AWS) — Compute Savings Plans (flexible across instance types), EC2 Instance Savings Plans (specific family)
  • Committed Use Discounts (GCP) — Spend-based or resource-based commitments, automatic application
  • Commitment Tradeoffs — Lock-in risk, capacity planning accuracy required, mix of reserved + on-demand for flexibility

Spot / Preemptible

  • Fault-Tolerant Workloads — Batch processing, CI/CD runners, stateless web workers, data processing
  • Spot Interruption Handling — 2-minute warning (AWS), graceful shutdown, checkpointing, instance diversification
  • Mixed Instance Strategies — Spot Fleet, mixed instance policies, fallback to on-demand, Karpenter (Kubernetes)

Storage Cost Optimization

  • Lifecycle Policies — Auto-transition objects to cheaper tiers (S3 Standard → Infrequent Access → Glacier)
  • Storage Class Transitions — Hot/warm/cold/archive tiers, retrieval costs, minimum storage duration charges
  • Data Transfer Costs — Egress fees (often the hidden cost), cross-region transfer, NAT Gateway costs, VPC endpoints to avoid transfer fees
  • Compression — Compress data at rest and in transit, columnar formats (Parquet) for analytics, gzip/zstd for logs

Unit Economics

  • Cost Per Request — Total infrastructure cost / total requests, track over time
  • Cost Per User — Infrastructure cost / MAU, benchmark against revenue per user
  • Cost Per Transaction — Database + compute + storage cost per business transaction
  • Tracking Cost Efficiency — Cost per unit of value should decrease over time as you scale and optimize

FinOps Practices

  • Cost Visibility — Dashboards showing real-time spend by team/service/environment, trend analysis
  • Showback / Chargeback — Showback: show teams their costs (awareness). Chargeback: bill teams for their usage (accountability).
  • Budgets and Alerts — Per-team budgets, threshold alerts (50%, 80%, 100%), forecasted overspend alerts
  • Tagging Strategy — Mandatory tags (team, environment, project, cost-center), tag enforcement policies, untagged resource reports
  • FinOps Team Models — Centralized FinOps team, federated (embedded in engineering), hybrid; FinOps Foundation maturity model (crawl, walk, run)
  • Tools — CloudHealth (VMware), Kubecost (Kubernetes cost allocation), Infracost (IaC cost estimation in PRs), AWS Cost Explorer, GCP Billing Reports, Vantage, Spot.io

reliability sre operations observability performance disaster-recovery traffic-management finops cost-engineering