Reliability, Observability & Security
SLOs and error budgets, the three pillars of observability, chaos engineering, disaster recovery, AuthN/AuthZ, API security — the operational concerns every system must address.
Why this module exists
The previous nine modules covered how to build a system. This one covers how to keep it standing after you do. Reliability, observability, and security are not optional features added at the end — they are properties of the entire architecture, baked in from the first decisions.
The topics here are the ones that distinguish junior architects from senior ones. A junior can draw the boxes; a senior can tell you which box will fail first, how you'll know, what the blast radius will be, and how the system will keep running while it's broken. The patterns and disciplines below are what that knowledge looks like in practice.
This is also the longest module to internalise, because the topics are operational — they only fully make sense after you've watched a production system misbehave. Read it once now for the framing; revisit each section when its concern hits you in real life.
SLI, SLO, SLA, and error budgets
Reliability targets need numbers, and the numbers need definitions. Google's SRE book introduced the now-standard vocabulary, and it's universal enough that every serious team should know it.
- SLI — Service Level Indicator. A measurement. The thing you watch. "p99 latency of the checkout endpoint." "Successful request rate over 5 minutes." "Time-to-first-byte for the homepage." The defining characteristic is that an SLI is a number derived from real production data, not a wish.
- SLO — Service Level Objective. A target for an SLI. "p99 latency under 300 ms." "Successful request rate at least 99.9%." An SLO is internal — what you've committed to your users behind the scenes.
- SLA — Service Level Agreement. A contractual SLO. Usually less strict than your internal SLO, with consequences when violated (credits, refunds, public status). Public SLAs are typically two nines below the internal SLO — you commit to 99.9% externally, target 99.95% internally.
SLI measurement "5xx rate over 5 min"
SLO internal target "5xx rate < 0.1%" ← what we aim for
SLA customer promise "99.9% uptime or 10% credit" ← what we agree to
Error budgets are the practical lever that comes from SLOs. If your SLO is 99.9% success, then 0.1% of requests are allowed to fail in the measurement window (typically 30 days). That 0.1% is your error budget.
The rules of error-budget-driven engineering:
- Budget remaining → ship features. If you have plenty of budget left, you're being too conservative. Take more risk, deploy more aggressively.
- Budget exhausted → freeze risky changes. No new features until reliability is back. Focus on bugs, monitoring, hardening.
The budget converts an abstract debate ("are we reliable enough?") into a concrete decision ("can we deploy this Friday?"). It also resolves the eternal tension between development and operations — both teams share one budget and one set of incentives.
Choosing SLIs that matter. A bad SLI is "CPU below 80%." Customers don't care about your CPU. A good SLI is the user-visible behaviour: "checkout completed within 500ms," "image loaded within 1s," "search returned results." Tie SLIs to journeys, not to infrastructure.
Choosing SLO numbers. Be honest. 99.99% requires multi-region, automated failover, and chaos testing. 99.9% is achievable for most well-run services. 99% is fine for many internal tools. Picking a higher number than you can hit just guarantees the budget is always exhausted; picking too low means you're under-investing in reliability.
The three pillars of observability
Observability is the ability to ask new questions about a running system without re-deploying or instrumenting it further. Three data types form the standard pillars: logs, metrics, traces. Each has different strengths; together they form a complete picture.
1. Logs. Time-stamped, structured records of events. "User 42 logged in." "DB query took 200ms." "OutOfMemoryError in worker-3."
Properties: high cardinality (one event per record), expensive at scale (storage and indexing), great for root-cause investigation ("what exactly happened at 14:23?"). Structured logs (JSON, key-value) are searchable; free-text logs need painful regex.
Tooling: Loki, ELK (Elasticsearch + Logstash + Kibana), Splunk, Datadog Logs, CloudWatch Logs. The pattern: emit JSON to stdout, collected by a sidecar (Vector, Fluentd, Logstash), stored centrally.
Rule: log enough to debug, not so much that you can't afford it. Sample debug-level logs (1 in 1000) in production. Always log errors. Include trace IDs (next pillar) for correlation.
2. Metrics. Aggregated numerical measurements over time. Counts, rates, gauges, histograms.
Properties: low cardinality (a few labels, fixed dimensions), cheap to store and query (a 30-day timeseries is small), great for dashboards and alerting. Cannot tell you about a specific request — only about aggregate behaviour.
Tooling: Prometheus (the de-facto standard for cloud-native), Graphite, Datadog, CloudWatch Metrics. Metrics use a pull or push model; Prometheus pulls every 15s from /metrics endpoints exposing OpenMetrics-format counters/gauges/histograms.
The USE method (Brendan Gregg) for infrastructure: every resource has Utilisation, Saturation, Errors. Track all three per resource (CPU, memory, disk, network). The RED method for services: every service has Rate, Errors, Duration. Track these per endpoint. With USE on hosts and RED on services, you can spot most problems.
A warning: high-cardinality labels are the silent killer of metrics systems. Don't tag metrics with user_id (millions of unique values) or request_id (billions). Use logs/traces for that level of detail.
3. Traces. Distributed request flow records. "This request hit the gateway, then service A, then service B, then the database. Total: 234ms. Service B took 180ms."
Properties: per-request (high cardinality, usually sampled), great for understanding flow and finding latency bottlenecks across service boundaries. OpenTelemetry is the unified standard for both metrics and traces.
Tooling: Jaeger, Zipkin, Tempo, Honeycomb, Datadog APM, AWS X-Ray. The pattern: each service instruments its incoming and outgoing calls, sends spans to a collector, the collector reconstructs full traces.
A trace is a tree of spans:
gateway (234ms) ───────────────────────────────────
├── service-A (210ms) ─────────────────────────
│ └── DB query (50ms)
├── service-B (180ms) ──────────────────────
│ ├── cache miss (10ms)
│ └── service-C (160ms) ────────────
│ └── external API (150ms) ← CULPRIT
└── service-D (15ms)
Sampling: keep ~1% of traces in production. Sample 100% in dev. Use tail-based sampling (decide after the trace completes) to always keep traces with errors or slow latency.
Putting them together. A typical investigation: alert fires (metrics) → look at affected endpoint dashboard (metrics) → drill into a slow trace (traces) → grep logs for the trace ID to see what each service was doing (logs). Each pillar earns its keep at a different step.
The correlation glue is a trace ID propagated through every layer — set at the entry point, attached to every log line, included in every downstream request header. Without it, you cannot stitch the three pillars together for a specific request.
Chaos engineering
If you don't deliberately break your system in controlled experiments, production will break it for you at the worst possible time. Chaos engineering is the practice of injecting failures into a running system to verify it behaves as designed when components misbehave.
The insight came from Netflix in the early 2010s: their cloud-hosted infrastructure was supposed to be resilient, but the only way to know was to actually kill things and watch. They built Chaos Monkey — a service that randomly terminated EC2 instances during business hours. Engineers learned fast.
The modern discipline has matured into a small set of practices:
- Define steady state. What does "normal" look like? Pick measurable SLIs that should stay stable during the experiment.
- Hypothesise. "If we kill one of the three database replicas, error rate stays below 0.1%."
- Inject the failure in a controlled blast radius. Start small — one instance, one service, one region.
- Verify against the hypothesis. Did the SLIs stay green? If not, the system has an undocumented failure mode; fix it.
- Expand. Once small experiments succeed reliably, run bigger ones — entire zones, regions, service outages.
Failure types worth injecting:
- Instance termination. Kill random pods/instances. Tests autoscaling and graceful handling of lost workers.
- Network partition. Block traffic between zones or services. Tests failover and timeout behaviour.
- Latency injection. Add 500ms to specific service calls. Tests timeouts, circuit breakers, and downstream resilience.
- Resource exhaustion. Fill disks; pin CPU; eat memory. Tests autoscaling triggers and degradation behaviour.
- Dependency failure. Make a downstream API return errors or stop responding. Tests retries, fallbacks, and the dependency's actual criticality.
Tooling: Chaos Monkey and the broader Simian Army, Gremlin, LitmusChaos (Kubernetes-native), AWS Fault Injection Service, Azure Chaos Studio.
Game days are scheduled exercises where on-call engineers respond to injected incidents. The injection is a known fault, but the team practises the response — detection, communication, mitigation, postmortem. Most teams discover during game days that their runbooks are out of date, their dashboards have blind spots, or their on-call rotation has handoff gaps.
The pre-requisites for chaos engineering: solid observability (you can't watch for impact without it), a clear rollback path, and confidence that single-component failures won't cascade. Don't run chaos experiments before you have monitoring and basic resilience. Start with game days; graduate to automated chaos once the team has muscle memory.
Disaster recovery
Disaster recovery is the plan for when an entire region of your infrastructure becomes unavailable — datacentre fire, cloud-provider outage, ransomware, accidental deletion of a critical table by a tired engineer at 2 AM. The four-letter acronyms are RTO and RPO.
- RTO — Recovery Time Objective. How quickly the service must be back up. "4 hours." "15 minutes."
- RPO — Recovery Point Objective. How much data you can afford to lose. "5 minutes worth." "Zero."
These drive every architectural decision about DR. Tight RTO + tight RPO = expensive multi-region active-active. Loose RTO + loose RPO = nightly backups to a different region, restored on demand.
Four standard DR strategies, in increasing cost and decreasing RTO:
1. Backup and restore. Periodic backups in another region. On disaster, provision new infra and restore. RTO: hours to days. RPO: hours (the gap between backups). Cheapest. Right for non-critical systems and as the bottom-tier safety net always.
2. Pilot light. Critical infrastructure (DB, key services) replicated continuously to a second region in minimal form. Other services dormant. On disaster, scale up the dormant services. RTO: tens of minutes. RPO: seconds (replication lag).
3. Warm standby. A scaled-down but running copy of the entire system in a second region. Traffic flips on disaster; the standby scales up to absorb full load. RTO: minutes. RPO: seconds.
4. Active-active multi-region. Both regions serving live traffic. On region failure, traffic shifts entirely to the surviving region. RTO: seconds (DNS / global LB failover). RPO: near zero. Most expensive: 2× capacity, cross-region data sync complexity, more cognitive load on every architectural choice.
RTO RPO Cost
Backup hours hours $
Pilot light ~30 min seconds $$
Warm standby ~5 min seconds $$$
Active-active <1 min ~zero $$$$
The replication problem. Active-active across regions means cross-region writes. The 100-200ms cross-region latency makes synchronous quorums slow; asynchronous replication risks data loss on failover. Most active-active systems pick one region as the primary for writes (asymmetric active-active), or partition data by region (each region owns specific users).
The most important DR practice is testing. A DR plan that hasn't been exercised in a year does not work; it just hasn't been disproven. Run a regional failover quarterly. Time it. Find what's missing — DNS TTLs you forgot to lower, IAM roles that don't exist in the secondary region, deployment pipelines that hard-coded the primary's account.
Backup hygiene. Backups in the same region as the primary DB are useless when the region goes. Backups in the same account are vulnerable to credential compromise. Apply: cross-region, cross-account, point-in-time recovery, periodic restore tests. The restore is the actual backup; the backup that has never been restored is theatre.
Authentication and authorisation
Two related but distinct concepts that get conflated constantly. Authentication (AuthN) answers "who are you?" Authorisation (AuthZ) answers "what are you allowed to do?"
The wrong pattern: "I'm logged in, so I can do anything." The right pattern: authentication establishes identity; every operation checks authorisation against that identity.
Authentication mechanisms.
- Password + something. Username and password establishes initial identity; a session cookie or token represents the authenticated session. "Something" should be a second factor — TOTP code, hardware key, push notification. SMS as second factor is widely deployed and widely broken (SIM swap attacks).
- OAuth 2.0. Delegated authentication. "Sign in with Google" — the user authenticates with Google, Google sends your app a token attesting to that. Your app trusts Google's authentication; you never see the user's password.
- OIDC (OpenID Connect). OAuth 2.0 plus standardised identity claims (email, name, profile). What "Sign in with X" actually uses today.
- SAML. Enterprise SSO standard. XML-based, ugly, ubiquitous in corporate IT. Talk to your identity provider once; downstream apps trust the assertion.
- API keys. Long-lived secret strings for machine-to-machine auth. Simple. Rotate them; scope them; never embed in client code.
- mTLS. Mutual TLS — both sides present certificates. Service-to-service auth in security-conscious environments. Service meshes (Istio, Linkerd) do this automatically.
Session representation. Two camps:
- Stateful sessions. Server stores session state; cookie or header carries an opaque session ID. Pro: revoke instantly (delete the session row). Con: requires a session store accessible to all backend instances.
- JWT (JSON Web Tokens). Self-contained signed tokens. Server verifies signature on each request without a database lookup. Pro: stateless backend. Con: cannot revoke a JWT — you can only refuse it via a blocklist (which is stateful again) or wait for expiry.
For most apps, short-lived JWTs (15 minutes) plus a long-lived refresh token gives you the best of both: fast verification, bounded blast radius if a token leaks. Logout invalidates the refresh token; access tokens expire naturally.
Authorisation models.
- RBAC (Role-Based Access Control). Users have roles; roles have permissions. Simple, widely understood, easy to manage. Works well until you need fine-grained per-resource permissions.
- ABAC (Attribute-Based). Decisions based on attributes of user, resource, action, environment. "User can edit a document if they own it OR they're in the same team AND it's during work hours." Flexible, can encode complex policies.
- ReBAC (Relationship-Based). Permissions derived from relationships in a graph. Google's Zanzibar paper formalised this; SpiceDB and OpenFGA implement it. Right model for systems with deep object hierarchies (Drive, GitHub).
The practical rule: start with RBAC. Add attribute checks for the cases that don't fit. Reach for ReBAC only when you have explicit hierarchical/relational permissions (folders, organisations, sharing).
Common AuthZ failures.
- IDOR (Insecure Direct Object Reference).
/orders/123returns order 123, no matter who's asking. Always check that the requesting user owns/can-access the resource. - Missing function-level checks. The admin endpoint is hidden in the UI but still callable via direct API. Check authorisation on every endpoint, not just the UI's protected routes.
- Privilege escalation via mass assignment. User PATCH endpoint accepts
{role: "admin"}and writes it to the DB. Whitelist updatable fields. - Token leakage via referer / logs / URLs. Tokens in URL query params end up in browser history, server logs, and HTTP referers. Always in headers or cookies.
API security
The web's hostile environment plus the API's machine-callability create a particular set of failure modes. The OWASP API Security Top 10 is the canonical list; the concerns repeat across nearly every API breach in the news.
Defence-in-depth at the edge.
- TLS everywhere. No exceptions. HSTS header on all responses.
- WAF (Web Application Firewall). Block known attack patterns (SQL injection signatures, traversal, common bot UAs). Cloudflare, AWS WAF, GCP Cloud Armor.
- DDoS protection. Edge-level rate limiting absorbs floods. Same providers as WAF.
- Geo / IP blocking for sensitive operations from regions you don't operate in.
Input validation. Every untrusted input is hostile until proven otherwise.
- Schema validation. JSON schema, OpenAPI validation at the gateway. Reject malformed requests before they reach handlers.
- Sanitisation for context. Output going into HTML gets HTML-escaped. Output going into SQL uses parameterised queries (never string concatenation). Output going into shell commands gets argument-array form.
- Limits on everything. Max request body size, max array length, max string length, max file size, max query complexity (for GraphQL). Without these, a single bad request can DoS the server.
SQL injection. Still the most common high-severity vulnerability. The fix is universal and trivial: parameterised queries. Every database driver in every language has them.
BAD (concat): "SELECT * FROM users WHERE id = " + user_input
GOOD (param): "SELECT * FROM users WHERE id = ?" with [user_input]
ORMs do this automatically if used correctly. The exception is raw SQL strings — never build them with concatenation.
XSS (Cross-Site Scripting). User-controlled content rendered as HTML without escaping. Attacker submits <script>steal(cookie)</script> as a comment; victim views the comment; their browser runs the script.
Defences: escape on output (every templating engine does this if you use it correctly), Content-Security-Policy header limiting where scripts can come from, SameSite=Strict cookies so they don't ride along with cross-site requests.
CSRF (Cross-Site Request Forgery). Attacker tricks the user's browser into making an authenticated request to your site. Defences: CSRF tokens on state-changing requests, SameSite cookies, CORS configured tightly.
Secrets management.
- Never in source code. Even in a private repo. Mistakes happen.
- Never in client code. Anything in JavaScript or a mobile app binary is public.
- Vault, AWS Secrets Manager, GCP Secret Manager. Centralised, audited, rotateable.
- Per-environment secrets. Production secrets never appear in dev/staging.
- Rotate regularly. Especially after personnel changes.
Logging — what NOT to log.
- Passwords (even hashed — they shouldn't traverse your code path at all after authentication).
- Card numbers, CVV.
- Personal data unless absolutely necessary (regulatory exposure).
- Full request/response bodies for sensitive endpoints (login, payments).
A single ALL-encompassing rule covers most of this: trust nothing from the network, validate everything, give the smallest privilege that works, log what you need and not what you don't. It is boring. It is what stops you ending up in the news.
This closes the High-Level System Design series. Across ten modules we've covered the vocabulary, the storage choices, the caching layers, the network plumbing, the asynchronous backbone, the architectural patterns, the distributed-systems primitives, the canonical designs, the supplementary topics, and the operational disciplines. The skill from here is practice — read postmortems, design systems on paper, ship things and watch them break, then read this series again and notice which parts you used and which parts you didn't. Architecture is a craft that gets sharper with every system you build.
⁂ Back to all modules