Home
High-Level System Design / Module 10 — Reliability, Observability & Security

Reliability, Observability & Security

SLOs and error budgets, the three pillars of observability, chaos engineering, disaster recovery, AuthN/AuthZ, API security — the operational concerns every system must address.


Why this module exists

The previous nine modules covered how to build a system. This one covers how to keep it standing after you do. Reliability, observability, and security are not optional features added at the end — they are properties of the entire architecture, baked in from the first decisions.

The topics here are the ones that distinguish junior architects from senior ones. A junior can draw the boxes; a senior can tell you which box will fail first, how you'll know, what the blast radius will be, and how the system will keep running while it's broken. The patterns and disciplines below are what that knowledge looks like in practice.

This is also the longest module to internalise, because the topics are operational — they only fully make sense after you've watched a production system misbehave. Read it once now for the framing; revisit each section when its concern hits you in real life.

SLI, SLO, SLA, and error budgets

Reliability targets need numbers, and the numbers need definitions. Google's SRE book introduced the now-standard vocabulary, and it's universal enough that every serious team should know it.

text
   SLI  measurement       "5xx rate over 5 min"
   SLO  internal target   "5xx rate < 0.1%"           ← what we aim for
   SLA  customer promise  "99.9% uptime or 10% credit" ← what we agree to

Error budgets are the practical lever that comes from SLOs. If your SLO is 99.9% success, then 0.1% of requests are allowed to fail in the measurement window (typically 30 days). That 0.1% is your error budget.

The rules of error-budget-driven engineering:

The budget converts an abstract debate ("are we reliable enough?") into a concrete decision ("can we deploy this Friday?"). It also resolves the eternal tension between development and operations — both teams share one budget and one set of incentives.

Choosing SLIs that matter. A bad SLI is "CPU below 80%." Customers don't care about your CPU. A good SLI is the user-visible behaviour: "checkout completed within 500ms," "image loaded within 1s," "search returned results." Tie SLIs to journeys, not to infrastructure.

Choosing SLO numbers. Be honest. 99.99% requires multi-region, automated failover, and chaos testing. 99.9% is achievable for most well-run services. 99% is fine for many internal tools. Picking a higher number than you can hit just guarantees the budget is always exhausted; picking too low means you're under-investing in reliability.

The three pillars of observability

Observability is the ability to ask new questions about a running system without re-deploying or instrumenting it further. Three data types form the standard pillars: logs, metrics, traces. Each has different strengths; together they form a complete picture.

1. Logs. Time-stamped, structured records of events. "User 42 logged in." "DB query took 200ms." "OutOfMemoryError in worker-3."

Properties: high cardinality (one event per record), expensive at scale (storage and indexing), great for root-cause investigation ("what exactly happened at 14:23?"). Structured logs (JSON, key-value) are searchable; free-text logs need painful regex.

Tooling: Loki, ELK (Elasticsearch + Logstash + Kibana), Splunk, Datadog Logs, CloudWatch Logs. The pattern: emit JSON to stdout, collected by a sidecar (Vector, Fluentd, Logstash), stored centrally.

Rule: log enough to debug, not so much that you can't afford it. Sample debug-level logs (1 in 1000) in production. Always log errors. Include trace IDs (next pillar) for correlation.

2. Metrics. Aggregated numerical measurements over time. Counts, rates, gauges, histograms.

Properties: low cardinality (a few labels, fixed dimensions), cheap to store and query (a 30-day timeseries is small), great for dashboards and alerting. Cannot tell you about a specific request — only about aggregate behaviour.

Tooling: Prometheus (the de-facto standard for cloud-native), Graphite, Datadog, CloudWatch Metrics. Metrics use a pull or push model; Prometheus pulls every 15s from /metrics endpoints exposing OpenMetrics-format counters/gauges/histograms.

The USE method (Brendan Gregg) for infrastructure: every resource has Utilisation, Saturation, Errors. Track all three per resource (CPU, memory, disk, network). The RED method for services: every service has Rate, Errors, Duration. Track these per endpoint. With USE on hosts and RED on services, you can spot most problems.

A warning: high-cardinality labels are the silent killer of metrics systems. Don't tag metrics with user_id (millions of unique values) or request_id (billions). Use logs/traces for that level of detail.

3. Traces. Distributed request flow records. "This request hit the gateway, then service A, then service B, then the database. Total: 234ms. Service B took 180ms."

Properties: per-request (high cardinality, usually sampled), great for understanding flow and finding latency bottlenecks across service boundaries. OpenTelemetry is the unified standard for both metrics and traces.

Tooling: Jaeger, Zipkin, Tempo, Honeycomb, Datadog APM, AWS X-Ray. The pattern: each service instruments its incoming and outgoing calls, sends spans to a collector, the collector reconstructs full traces.

A trace is a tree of spans:

text
   gateway (234ms) ───────────────────────────────────
     ├── service-A (210ms) ─────────────────────────
     │     └── DB query (50ms)
     ├── service-B (180ms) ──────────────────────
     │     ├── cache miss (10ms)
     │     └── service-C (160ms) ────────────
     │           └── external API (150ms)  ←  CULPRIT
     └── service-D (15ms)

Sampling: keep ~1% of traces in production. Sample 100% in dev. Use tail-based sampling (decide after the trace completes) to always keep traces with errors or slow latency.

Putting them together. A typical investigation: alert fires (metrics) → look at affected endpoint dashboard (metrics) → drill into a slow trace (traces) → grep logs for the trace ID to see what each service was doing (logs). Each pillar earns its keep at a different step.

The correlation glue is a trace ID propagated through every layer — set at the entry point, attached to every log line, included in every downstream request header. Without it, you cannot stitch the three pillars together for a specific request.

Chaos engineering

If you don't deliberately break your system in controlled experiments, production will break it for you at the worst possible time. Chaos engineering is the practice of injecting failures into a running system to verify it behaves as designed when components misbehave.

The insight came from Netflix in the early 2010s: their cloud-hosted infrastructure was supposed to be resilient, but the only way to know was to actually kill things and watch. They built Chaos Monkey — a service that randomly terminated EC2 instances during business hours. Engineers learned fast.

The modern discipline has matured into a small set of practices:

Failure types worth injecting:

Tooling: Chaos Monkey and the broader Simian Army, Gremlin, LitmusChaos (Kubernetes-native), AWS Fault Injection Service, Azure Chaos Studio.

Game days are scheduled exercises where on-call engineers respond to injected incidents. The injection is a known fault, but the team practises the response — detection, communication, mitigation, postmortem. Most teams discover during game days that their runbooks are out of date, their dashboards have blind spots, or their on-call rotation has handoff gaps.

The pre-requisites for chaos engineering: solid observability (you can't watch for impact without it), a clear rollback path, and confidence that single-component failures won't cascade. Don't run chaos experiments before you have monitoring and basic resilience. Start with game days; graduate to automated chaos once the team has muscle memory.

Disaster recovery

Disaster recovery is the plan for when an entire region of your infrastructure becomes unavailable — datacentre fire, cloud-provider outage, ransomware, accidental deletion of a critical table by a tired engineer at 2 AM. The four-letter acronyms are RTO and RPO.

These drive every architectural decision about DR. Tight RTO + tight RPO = expensive multi-region active-active. Loose RTO + loose RPO = nightly backups to a different region, restored on demand.

Four standard DR strategies, in increasing cost and decreasing RTO:

1. Backup and restore. Periodic backups in another region. On disaster, provision new infra and restore. RTO: hours to days. RPO: hours (the gap between backups). Cheapest. Right for non-critical systems and as the bottom-tier safety net always.

2. Pilot light. Critical infrastructure (DB, key services) replicated continuously to a second region in minimal form. Other services dormant. On disaster, scale up the dormant services. RTO: tens of minutes. RPO: seconds (replication lag).

3. Warm standby. A scaled-down but running copy of the entire system in a second region. Traffic flips on disaster; the standby scales up to absorb full load. RTO: minutes. RPO: seconds.

4. Active-active multi-region. Both regions serving live traffic. On region failure, traffic shifts entirely to the surviving region. RTO: seconds (DNS / global LB failover). RPO: near zero. Most expensive: 2× capacity, cross-region data sync complexity, more cognitive load on every architectural choice.

text
                  RTO         RPO         Cost
   Backup        hours       hours       $
   Pilot light   ~30 min     seconds     $$
   Warm standby  ~5 min      seconds     $$$
   Active-active <1 min      ~zero       $$$$

The replication problem. Active-active across regions means cross-region writes. The 100-200ms cross-region latency makes synchronous quorums slow; asynchronous replication risks data loss on failover. Most active-active systems pick one region as the primary for writes (asymmetric active-active), or partition data by region (each region owns specific users).

The most important DR practice is testing. A DR plan that hasn't been exercised in a year does not work; it just hasn't been disproven. Run a regional failover quarterly. Time it. Find what's missing — DNS TTLs you forgot to lower, IAM roles that don't exist in the secondary region, deployment pipelines that hard-coded the primary's account.

Backup hygiene. Backups in the same region as the primary DB are useless when the region goes. Backups in the same account are vulnerable to credential compromise. Apply: cross-region, cross-account, point-in-time recovery, periodic restore tests. The restore is the actual backup; the backup that has never been restored is theatre.

Authentication and authorisation

Two related but distinct concepts that get conflated constantly. Authentication (AuthN) answers "who are you?" Authorisation (AuthZ) answers "what are you allowed to do?"

The wrong pattern: "I'm logged in, so I can do anything." The right pattern: authentication establishes identity; every operation checks authorisation against that identity.

Authentication mechanisms.

Session representation. Two camps:

For most apps, short-lived JWTs (15 minutes) plus a long-lived refresh token gives you the best of both: fast verification, bounded blast radius if a token leaks. Logout invalidates the refresh token; access tokens expire naturally.

Authorisation models.

The practical rule: start with RBAC. Add attribute checks for the cases that don't fit. Reach for ReBAC only when you have explicit hierarchical/relational permissions (folders, organisations, sharing).

Common AuthZ failures.

API security

The web's hostile environment plus the API's machine-callability create a particular set of failure modes. The OWASP API Security Top 10 is the canonical list; the concerns repeat across nearly every API breach in the news.

Defence-in-depth at the edge.

Input validation. Every untrusted input is hostile until proven otherwise.

SQL injection. Still the most common high-severity vulnerability. The fix is universal and trivial: parameterised queries. Every database driver in every language has them.

text
   BAD (concat):    "SELECT * FROM users WHERE id = " + user_input
   GOOD (param):    "SELECT * FROM users WHERE id = ?"  with [user_input]

ORMs do this automatically if used correctly. The exception is raw SQL strings — never build them with concatenation.

XSS (Cross-Site Scripting). User-controlled content rendered as HTML without escaping. Attacker submits <script>steal(cookie)</script> as a comment; victim views the comment; their browser runs the script.

Defences: escape on output (every templating engine does this if you use it correctly), Content-Security-Policy header limiting where scripts can come from, SameSite=Strict cookies so they don't ride along with cross-site requests.

CSRF (Cross-Site Request Forgery). Attacker tricks the user's browser into making an authenticated request to your site. Defences: CSRF tokens on state-changing requests, SameSite cookies, CORS configured tightly.

Secrets management.

Logging — what NOT to log.

A single ALL-encompassing rule covers most of this: trust nothing from the network, validate everything, give the smallest privilege that works, log what you need and not what you don't. It is boring. It is what stops you ending up in the news.

This closes the High-Level System Design series. Across ten modules we've covered the vocabulary, the storage choices, the caching layers, the network plumbing, the asynchronous backbone, the architectural patterns, the distributed-systems primitives, the canonical designs, the supplementary topics, and the operational disciplines. The skill from here is practice — read postmortems, design systems on paper, ship things and watch them break, then read this series again and notice which parts you used and which parts you didn't. Architecture is a craft that gets sharper with every system you build.


⁂ Back to all modules