07 — Reliability & Operations MOC

← Back to Software Engineering - Map of Content

Keeping systems running, performant, and recoverable. The difference between a demo and production is reliability.

Observability

The Three Pillars

Logs — Discrete events with context, structured (JSON) vs unstructured
Metrics — Numeric measurements over time (counters, gauges, histograms)
Traces — End-to-end request path through distributed services

Logging

Structured Logging — Key-value pairs, JSON format, machine-parseable
Log Levels — DEBUG, INFO, WARN, ERROR, FATAL
Correlation IDs / Request IDs — Trace a request across services
Log Aggregation — ELK Stack (Elasticsearch, Logstash, Kibana), Loki + Grafana, Splunk, Datadog
Sampling — Log a percentage to manage volume at scale
Best Practices — Don’t log PII, include context, structured over printf

Metrics

Metric Types — Counter (monotonically increasing), Gauge (current value), Histogram (distribution), Summary
RED Method — Rate, Errors, Duration (for request-driven services)
USE Method — Utilization, Saturation, Errors (for resources: CPU, memory, disk, network)
Four Golden Signals (Google SRE) — Latency, Traffic, Errors, Saturation
Tools — Prometheus, Grafana, Datadog, CloudWatch, StatsD, InfluxDB
Percentiles — p50, p95, p99, p99.9 — more informative than averages

Distributed Tracing

Concepts — Spans, traces, parent-child relationships, baggage/context propagation
OpenTelemetry — Vendor-neutral standard for traces, metrics, and logs
Jaeger — Open-source distributed tracing
Zipkin — Another open-source tracing system
AWS X-Ray, Datadog APM — Managed tracing solutions
Trace Sampling — Head-based vs tail-based sampling

Dashboards & Alerting

Dashboard Design — Hierarchy of information, actionable views, not vanity metrics
Grafana — Open-source dashboarding, data source agnostic
Alerting Philosophy — Alert on symptoms not causes, reduce noise, actionable alerts
Alert Fatigue — Too many alerts → ignored alerts → incidents
PagerDuty / OpsGenie / Incident.io — Alert routing, escalation, on-call management

Incident Management

On-Call

On-Call Rotations — Primary/secondary, follow-the-sun, fair scheduling
Runbooks — Step-by-step guides for common incidents
Escalation Policies — When and how to escalate
Toil — Repetitive operational work that should be automated

Incident Response

Severity Levels — SEV1 (critical) through SEV4 (minor)
Incident Commander — Central coordination role during incidents
Communication — Status pages, internal updates, customer communication
Mitigation vs Root Cause Fix — Stop the bleeding first, fix properly later
War Rooms — Real-time coordination (Slack channels, video calls)

Post-Incident Process

Blameless Post-Mortems — Focus on systems and processes, not individuals
Post-Mortem Template — Summary, timeline, impact, root cause, contributing factors, action items
5 Whys — Root cause analysis technique
Action Items — Concrete, assigned, tracked, prioritized
Learning Culture — Share post-mortems widely, celebrate learning from failure

Site Reliability Engineering

SLAs, SLOs, SLIs

SLI (Service Level Indicator) — Measurable metric (e.g., latency p99 < 200ms)
SLO (Service Level Objective) — Target for an SLI (e.g., 99.9% of requests < 200ms)
SLA (Service Level Agreement) — Contractual commitment with consequences
Choosing SLOs — Based on user expectations, not technical capability
The Nines — 99.9% = 8.76h downtime/year, 99.99% = 52.6min/year

Error Budgets

Concept — Allowed unreliability = 1 - SLO target
Budget Consumption — Track error budget burn rate
Budget Policy — When budget is exhausted → focus on reliability over features
Burn Rate Alerts — Alert when burning budget too fast

Toil Management

Definition — Manual, repetitive, automatable, tactical, no enduring value
Toil Budget — SRE teams aim for < 50% toil
Automation — Self-healing systems, auto-remediation, infrastructure as code
Elimination Strategies — Automate, simplify, eliminate the need

Reliability Patterns

Graceful Degradation — Reduced functionality instead of total failure
Circuit Breakers — Stop calling failing services (see Distributed Systems)
Bulkheads — Isolate failures to prevent cascading
Timeouts — Always set timeouts, use adaptive timeouts
Retries — Exponential backoff with jitter, idempotency requirement
Rate Limiting — Protect services from overload (see API Design)
Health Checks — Liveness probes, readiness probes, startup probes
Redundancy — No single points of failure, multi-AZ, multi-region

SRE Team Models

Embedded SRE — SREs sit within product teams, deep product context
Centralized / Platform SRE — Shared SRE team provides tools and platforms for all teams
Consulting SRE — SREs advise teams on reliability, teams own their services
SRE vs DevOps — SRE is a specific implementation of DevOps principles with prescriptive practices
Service Ownership — “You build it, you run it” — teams own production behavior of their services
Production Readiness Reviews (PRR) — Checklist before a new service goes to production (monitoring, alerting, runbooks, load testing, failover)

Disaster Recovery & Business Continuity

Backup & Restore

Backup Strategies — Full, incremental, differential backups
RPO (Recovery Point Objective) — Maximum acceptable data loss (time-based)
RTO (Recovery Time Objective) — Maximum acceptable downtime
Backup Testing — Regular restore drills, verify backups actually work
Point-in-Time Recovery (PITR) — Database WAL replay, continuous archiving

Failover Strategies

Active-Passive — Standby takes over on primary failure, cold/warm/hot standby
Active-Active — Multiple active regions, traffic split, conflict resolution needed
DNS Failover — Route53 health checks, failover routing policies
Database Failover — Automatic promotion of replicas, split-brain prevention
Multi-Region — Cross-region replication, regional isolation, global load balancing

Disaster Recovery Tiers

Backup & Restore — Cheapest, slowest recovery (hours)
Pilot Light — Minimal always-on infrastructure, scale up on failure
Warm Standby — Scaled-down copy of production, faster failover (minutes)
Multi-Site Active-Active — Full redundancy, near-zero downtime, highest cost

Business Continuity

DR Runbooks — Step-by-step recovery procedures per service
DR Drills — Regular simulated failovers, tabletop exercises
Dependency Mapping — Know your critical path and upstream/downstream dependencies
Blast Radius Reduction — Cell-based architecture, shuffle sharding, failure domains

Traffic Management

Traffic Shifting

Weighted Routing — Gradually shift traffic between deployments (see CI-CD)
Traffic Mirroring / Shadowing — Duplicate production traffic to test environment without affecting users
Dark Traffic — Send real traffic to new service version, discard responses, compare
Geographic Routing — Route users to nearest region, latency-based routing

Load Shedding

Priority-Based Shedding — Drop low-priority requests under load to protect critical paths
Admission Control — Reject requests early when system is at capacity
Throttling — Per-client rate limits, adaptive throttling based on system health
Backpressure Propagation — Signal upstream to slow down rather than buffering until OOM

Configuration Management in Production

Feature Flags — Runtime configuration changes without deploys (see CI-CD)
Dynamic Configuration — Config servers (Consul, etcd), runtime tuning without restarts
Configuration Drift — Detect and remediate drift between intended and actual state
Secrets Rotation — Automated rotation of credentials, certificates, API keys

Performance Engineering

Profiling

CPU Profiling — Flame graphs, sampling profilers (perf, pprof, py-spy, async-profiler)
Memory Profiling — Heap analysis, memory leaks, allocation tracking
I/O Profiling — Disk I/O, network I/O, database query analysis
Application Performance Monitoring (APM) — Datadog, New Relic, Dynatrace

Benchmarking

Micro-Benchmarks — Test individual functions (JMH for Java, BenchmarkDotNet, criterion for Rust)
Macro-Benchmarks — System-level performance testing
Statistical Rigor — Warm-up, iterations, percentiles, variance
Comparison Testing — A/B performance comparisons, regression detection

Load Testing

Tools — k6, Locust, Gatling, JMeter, Artillery, wrk
Test Types — Load test, stress test, soak test, spike test, breakpoint test
Traffic Patterns — Ramp-up, steady state, spike, diurnal patterns
Metrics to Watch — Latency percentiles, throughput, error rate, resource utilization

Capacity Planning

Forecasting — Historical trends, growth projections
Headroom — Buffer for traffic spikes, typically 2-3x average
Auto-Scaling — Horizontal (add instances), vertical (bigger instances)
Cost Optimization — Right-sizing, reserved instances, spot instances, serverless economics

Chaos Engineering

Principles — Define steady state, hypothesize, introduce failures, observe
Chaos Monkey — Randomly terminate instances (Netflix)
Litmus Chaos / Chaos Mesh — Kubernetes-native chaos tools
Game Days — Planned chaos experiments with the team
Failure Injection — Network partitions, latency injection, disk failure, DNS failure

Performance Optimization Patterns

Caching — See Caching in Data Management
Connection Pooling — Reuse database/HTTP connections
Async Processing — Move work off the hot path, use queues
Batch Processing — Amortize overhead across multiple items
Denormalization — Trade storage for read performance
Read Replicas — Scale reads horizontally
Compression — gzip, Brotli, zstd for network transfer
Lazy Loading — Load data only when needed

Cost Engineering / FinOps

Cloud Cost Fundamentals

Pay-Per-Use vs Reserved vs Spot — On-demand (flexible, expensive), reserved (1-3yr commitment, 30-60% savings), spot/preemptible (up to 90% savings, can be interrupted)
Cost Allocation Tags — Tag resources by team, project, environment; required for cost visibility and showback/chargeback
Cost Anomaly Detection — Automated alerts for unexpected spending spikes, AWS Cost Anomaly Detection, budget alerts

Right-Sizing

Instance Type Selection — Match instance family to workload (compute-optimized, memory-optimized, general purpose, GPU)
CPU / Memory Utilization Analysis — Most instances run at 10-30% utilization; downsizing often has no performance impact
Auto-Scaling Efficiency — Scale-to-zero for dev/staging, aggressive scale-down policies, scheduled scaling for predictable patterns

Reserved Capacity

Reserved Instances (AWS) — Standard vs convertible, all-upfront/partial/no-upfront, 1yr vs 3yr, scope (regional vs zonal)
Savings Plans (AWS) — Compute Savings Plans (flexible across instance types), EC2 Instance Savings Plans (specific family)
Committed Use Discounts (GCP) — Spend-based or resource-based commitments, automatic application
Commitment Tradeoffs — Lock-in risk, capacity planning accuracy required, mix of reserved + on-demand for flexibility

Spot / Preemptible

Fault-Tolerant Workloads — Batch processing, CI/CD runners, stateless web workers, data processing
Spot Interruption Handling — 2-minute warning (AWS), graceful shutdown, checkpointing, instance diversification
Mixed Instance Strategies — Spot Fleet, mixed instance policies, fallback to on-demand, Karpenter (Kubernetes)

Storage Cost Optimization

Lifecycle Policies — Auto-transition objects to cheaper tiers (S3 Standard → Infrequent Access → Glacier)
Storage Class Transitions — Hot/warm/cold/archive tiers, retrieval costs, minimum storage duration charges
Data Transfer Costs — Egress fees (often the hidden cost), cross-region transfer, NAT Gateway costs, VPC endpoints to avoid transfer fees
Compression — Compress data at rest and in transit, columnar formats (Parquet) for analytics, gzip/zstd for logs

Unit Economics

Cost Per Request — Total infrastructure cost / total requests, track over time
Cost Per User — Infrastructure cost / MAU, benchmark against revenue per user
Cost Per Transaction — Database + compute + storage cost per business transaction
Tracking Cost Efficiency — Cost per unit of value should decrease over time as you scale and optimize

FinOps Practices

Cost Visibility — Dashboards showing real-time spend by team/service/environment, trend analysis
Showback / Chargeback — Showback: show teams their costs (awareness). Chargeback: bill teams for their usage (accountability).
Budgets and Alerts — Per-team budgets, threshold alerts (50%, 80%, 100%), forecasted overspend alerts
Tagging Strategy — Mandatory tags (team, environment, project, cost-center), tag enforcement policies, untagged resource reports
FinOps Team Models — Centralized FinOps team, federated (embedded in engineering), hybrid; FinOps Foundation maturity model (crawl, walk, run)
Tools — CloudHealth (VMware), Kubecost (Kubernetes cost allocation), Infracost (IaC cost estimation in PRs), AWS Cost Explorer, GCP Billing Reports, Vantage, Spot.io

reliability sre operations observability performance disaster-recovery traffic-management finops cost-engineering

Software Engineering KB

Explorer

07 - Reliability and Operations MOC

07 — Reliability & Operations MOC

Observability

The Three Pillars

Logging

Metrics

Distributed Tracing

Dashboards & Alerting

Incident Management

On-Call

Incident Response

Post-Incident Process

Site Reliability Engineering

SLAs, SLOs, SLIs

Error Budgets

Toil Management

Reliability Patterns

SRE Team Models

Disaster Recovery & Business Continuity

Backup & Restore

Failover Strategies

Disaster Recovery Tiers

Business Continuity

Traffic Management

Traffic Shifting

Load Shedding

Configuration Management in Production

Performance Engineering

Profiling

Benchmarking

Load Testing

Capacity Planning

Chaos Engineering

Performance Optimization Patterns

Cost Engineering / FinOps

Cloud Cost Fundamentals

Right-Sizing

Reserved Capacity

Spot / Preemptible

Storage Cost Optimization

Unit Economics

FinOps Practices

Graph View

Table of Contents

Backlinks