Architecture Patterns
Microservices vs monolith, circuit breaker, rate limiting, retry and backoff, saga, API patterns, bulkhead and timeout — the named patterns that recur in every real system.
Microservices vs monolith
The most-debated architectural question of the last decade also has the most-misunderstood answer. The honest version is: neither one is right by default, and the choice depends on team size, deploy independence, and the cost of being wrong.
A monolith is a single deployable unit. One codebase, one build, one process (replicated for scale). All the business logic lives in one place; modules call each other in-process. PHP'd version of the early Facebook, Basecamp's Rails monolith, Shopify's monolith — these handle enormous load. "Monolith" is not synonymous with "badly architected."
Microservices is one deployable unit per bounded context — a coherent slice of business logic with its own data and its own team. Each service has its own database, its own deploy pipeline, its own runtime. Services talk over the network (HTTP, gRPC, queues).
Monolith Microservices
──────── ─────────────
┌─────────────┐ ┌──────┐┌──────┐┌──────┐
│ Orders │ │Orders││Users ││Cart │
│ Users │ └──────┘└──────┘└──────┘
│ Cart │ ┌──────┐┌──────┐┌──────┐
│ Payments │ │Pay ││Ship ││Inv │
│ Shipping │ └──────┘└──────┘└──────┘
│ Inventory │ (each its own DB, deploy, team)
└─────────────┘
one DB, one deploy
The microservices pitch is independence: teams ship without coordinating, each service scales separately, a fault in one is contained. The hidden costs are also real:
- Network calls replace function calls. What was a 100-ns method call becomes a 1-ms HTTP call. Aggregating across a chain of 5 services, even local, adds up. Latency budgets matter.
- Distributed-systems failures appear everywhere. Partial failures, retries, timeouts, eventual consistency — every cross-service call brings them in.
- Operational overhead multiplies. Ten services means ten CI pipelines, ten dashboards, ten on-call rotations (or one rotation that must understand all ten).
- Data consistency across services is hard. A transaction that used to be
BEGIN ... COMMITis now a saga (covered later in this module).
A pragmatic decision framework:
- <10 engineers, single product: monolith. Modularise inside the codebase. Resist microservices on "future-proofing" grounds.
- 10-50 engineers, multiple product areas: modular monolith. Strict internal module boundaries; one deploy; pull out a service if a module truly needs independent scale.
- 50+ engineers: microservices probably make sense, but along team boundaries (Conway's Law) and bounded contexts (DDD). Not along arbitrary lines like "every entity gets a service."
Watch for the failure mode of premature decomposition: a small team with five microservices and four engineers spends most of their time on cross-service plumbing instead of product. Modular monolith first, microservices when you have evidence of organisational pain is a defensible default.
Circuit breaker
When a downstream service is dying, calling it more does not help. Each call you make is more load on the dying service AND more pending requests piled up in your service, eventually exhausting your own thread pool or memory. The cascading failure pattern is among the most common ways a distributed system goes from "one slow service" to "site down."
A circuit breaker sits between your code and the downstream call. It tracks recent failures. When failures pass a threshold, the breaker opens — and all calls fail fast for some cooldown period, without ever touching the downstream service.
States:
CLOSED → calls pass through normally; failures counted
│
│ (failure threshold exceeded)
▼
OPEN → calls fail immediately for cooldown period
│
│ (cooldown elapsed)
▼
HALF-OPEN → allow a few test calls
│
├── success → back to CLOSED
└── failure → back to OPEN
The payoff:
- The failing service gets time to recover instead of being hammered.
- Your service degrades gracefully — failures are fast and predictable rather than slow and resource-exhausting.
- The blast radius is bounded. One bad downstream doesn't take down your whole pod.
Real-world implementations: Hystrix (Netflix, now in maintenance), resilience4j (the modern Java standard), Polly (.NET), Envoy's outlier detection (mesh-level), opossum (Node.js). Most service meshes implement circuit breaking at the proxy level — no library required.
Tuning the breaker matters and is not obvious:
- Failure threshold. "5 failures in 10 seconds" or "50% error rate over 20 calls" — be percentage-based, not count-based, so light traffic doesn't trip the breaker on a single failure.
- Cooldown duration. Long enough that the service can actually recover (30-60 seconds typical), short enough that you're not failing forever after a transient blip.
- Half-open behaviour. Allow 1-3 trial calls — not the full traffic flow — to avoid stampeding a barely-recovered service.
The pattern pairs naturally with fallbacks. When the breaker is open, return a sensible default instead of an error if possible — cached data, an empty list, a "degraded mode" flag. The user sees the site working, possibly with reduced functionality, instead of broken.
Rate limiting
Rate limiting bounds the number of requests a client can make in a time window. It is a hygiene tool — it protects your infrastructure from misbehaving clients, paying customers from each other, and your bill from runaway processes.
The four canonical algorithms:
1. Fixed window. Counter resets at the start of each window. "100 requests per minute, reset on the minute." Simple to implement (an INCR with a TTL in Redis). Suffers from edge bursts: a client can make 100 requests at 12:00:59 and another 100 at 12:01:00, for 200 requests in 2 seconds.
2. Sliding window log. Store the timestamp of every request. To check the limit, count requests within the last minute. Accurate but expensive — memory is proportional to request rate.
3. Sliding window counter. Hybrid: store a counter per fixed window AND interpolate based on how much of the current window has elapsed. Approximate but cheap (two counters per client). Most production rate limiters use this.
4. Token bucket. A bucket holds N tokens; tokens refill at rate R per second; each request consumes one. If the bucket is empty, the request is rejected (or queued). Smooth, allows bursts up to the bucket size, the canonical algorithm in network gear and API gateways.
Token bucket: bucket size 10, refill 2 tokens/sec
t=0s: bucket=10 → burst of 10 requests allowed instantly
t=1s: bucket=2 → request 11 rejected (bucket empty after burst)
t=2s: bucket=4 → 4 requests allowed
t=10s: bucket=10 → refilled to max
Where to enforce the limit matters:
- At the edge / API gateway. First line of defence. Reject before reaching backend. Cheapest per request.
- Per service. Defends against internal callers (one buggy service spinning) and against gateway bypass.
- Per database/connection pool. The hardest layer to overload; sometimes you bound concurrent connections (semaphore) rather than RPS.
What to key on: API key (for paid plans), user ID (for per-user fairness), IP address (for unauthenticated traffic — but careful with NAT and corporate gateways), client device ID, or some combination.
Response when limited. Return 429 Too Many Requests with a Retry-After header indicating how many seconds to wait. Many client libraries will respect this automatically.
A distributed setup needs a shared counter — usually Redis. The pattern INCR rate:user:42:minute plus EXPIRE rate:user:42:minute 60 is the entire rate limiter for fixed window. Token bucket needs a few more operations but is still a small Lua script in Redis.
Don't overlook adaptive rate limits: limits that tighten under load. When your latency p99 spikes, drop everyone's limits by 50% temporarily. This is what protects you from a traffic spike you didn't capacity-plan for.
Retry with backoff
Network calls fail transiently — a packet drops, a process restarts, a database briefly pauses for a checkpoint. Most of the time, retrying succeeds. But the way you retry decides whether you fix a problem or make it worse.
Three retry rules that always apply:
- Only retry idempotent operations — or operations made idempotent via an idempotency key. Retrying a non-idempotent "charge $50" without a key can charge twice.
- Use exponential backoff. Wait longer after each failure: 100 ms, 200 ms, 400 ms, 800 ms... Without backoff, you flood the downstream service when it's already struggling.
- Add jitter. A pure exponential backoff causes synchronised retries — every failed client retries in lock-step. Add randomness:
sleep_ms = base * 2^attempt * random(0.5, 1.5). This is the "thundering herd" defence from the cache module, applied to retries.
Without jitter (every client lined up)
──────────────────────────────────────
t=0 ALL fail
t=100ms ALL retry (storm)
t=200ms ALL retry (storm)
t=400ms ALL retry (storm)
With jitter (smeared)
─────────────────────
t=0 ALL fail
t=50-150ms retries trickle in
t=120-280ms trickle
t=300-500ms trickle
When NOT to retry:
- 4xx errors (except 429). A 400 "Bad Request" is not going to become valid on retry; you'll just fail forever.
- Past a maximum. Cap retries at 3-5 attempts. Beyond that, you are spending more on retries than on real work.
- When circuit breaker is open. The breaker exists to prevent retries; don't fight it.
- Inside a request handler with a tight deadline. Retrying takes time. If the user's request has a 1-second budget and the first call took 800 ms, a 500-ms retry won't help.
Retry budgets: Google's SRE book popularised the idea of a global retry-rate cap. If retries are more than 10% of total calls, something is wrong; instead of retrying more, you back off everywhere. This prevents the failure pattern where the system spends most of its capacity retrying.
Idempotency keys make retries safe even for state-changing operations. The client generates a unique key per logical operation and sends it on every retry. The server records the key + result; on a retry of the same key, return the stored result instead of executing again. Stripe's API is the canonical reference for this design — every payment request takes an Idempotency-Key header.
Saga pattern
A traditional database transaction gives you all-or-nothing semantics: every write commits or none do. In a microservices architecture, the writes span multiple services and multiple databases — a single ACID transaction is impossible. You need a different shape.
A saga is a long-running business transaction implemented as a sequence of local transactions, each in one service. If any step fails, compensating transactions run for the previously-completed steps to roll back what they did.
The canonical example is placing an order:
Forward path:
1. Reserve inventory (Inventory service)
2. Charge payment (Payment service)
3. Create shipment (Shipping service)
4. Send confirmation (Notification service)
If step 3 fails:
↓
Compensate:
1. Refund payment (Payment service)
2. Release inventory (Inventory service)
Key properties:
- Each step is a local ACID transaction. Within a service, you still have full database guarantees.
- Compensations are not the inverse of the original action. A bank doesn't "un-charge" — it issues a refund. An order isn't "un-shipped" — it triggers a return flow. Design the compensation as a real business operation.
- Compensations can also fail. Refunds can fail. Inventory release can fail. You need durable retry of compensations and, in rare cases, human escalation.
Two styles of saga orchestration:
1. Choreography. Each service listens to events and decides whether to act. No central coordinator. Inventory service consumes OrderPlaced, reserves stock, emits InventoryReserved. Payment service consumes that, charges, emits PaymentCharged. And so on. Distributed; no single point of control; emergent.
Pros: services stay decoupled. Cons: hard to reason about the full flow; visualising what's happening for a stuck order requires hunting through logs in multiple services.
2. Orchestration. A dedicated orchestrator service drives the saga step by step. It calls inventory, calls payment, calls shipping, deciding what's next based on each response. Centralised; explicit; auditable.
Pros: the flow is in one place; failures are obvious. Cons: the orchestrator can become a god service; it's a coupling point.
Real systems mix both — orchestration for the critical happy-path workflows (orders, payments) and choreography for fan-out side effects (analytics, recommendations).
Tooling: Temporal, AWS Step Functions, Camunda, Cadence are workflow engines designed for orchestrated sagas. They handle persistence, retries, timeouts, and human-step integrations. For simpler cases, a saga can be implemented with a state machine in a database + a worker that polls and advances it.
The critical mindset shift: sagas give you eventual consistency, not atomicity. There are moments when inventory is reserved but payment is not yet charged. Your business needs to be okay with these intermediate states — or you need to design the steps so the intermediate state is at least not user-visible.
API design patterns
The way services expose themselves to each other shapes everything that touches them. Three patterns dominate today, and each one has a place.
REST. Resources identified by URL, operations identified by HTTP verb (GET, POST, PUT, PATCH, DELETE), data exchanged as JSON. The default for public web APIs. Strengths: discoverable, cacheable via HTTP semantics, every client and tool understands it, easy to debug with curl. Weaknesses: over- or under-fetching is common (the client wants 3 fields from a /users endpoint that returns 30), N+1 problems when joining related resources.
Good REST follows a few rules: nouns for resources, verbs as HTTP methods, status codes that mean what they say (200 success, 201 created, 400 bad request, 404 not found, 409 conflict, 500 server error). Versioning typically via the URL path (/v1/users) or a header (Accept: application/vnd.api.v2+json). Pagination via cursor (better) or page/offset (simpler but breaks under writes).
GraphQL. A query language where the client specifies exactly the fields it wants, possibly across multiple resources, in one request. The server resolves the query against multiple data sources. Strengths: no over-fetching, no N+1 at the network layer (the server does the joining), evolves without versioning (just add fields). Weaknesses: hard to cache at the HTTP layer (every query is unique); query complexity can be a DoS vector if unbounded; a single bad resolver makes a query slow.
GraphQL shines when: many client types need different shapes of the same data, mobile apps where bandwidth matters, dashboards with deeply nested queries. It is overkill for: simple CRUD APIs with one client.
gRPC. A binary, HTTP/2-based RPC protocol with strongly typed schemas (Protocol Buffers). Strengths: small payloads, streaming in both directions, generated client libraries in every major language, contract enforcement at compile time. Weaknesses: not directly callable from browsers (you need a proxy like grpc-web); harder to debug; JSON-vs-binary mental shift.
gRPC is the default for service-to-service inside a cluster. The performance and type-safety benefits compound when you have many services calling each other.
A few cross-cutting design rules that matter regardless of protocol:
- Pagination, always. Any list endpoint that could return many results must paginate. Cursor-based is more correct under writes than offset-based.
- Idempotency. State-changing endpoints should accept an idempotency key.
- Filtering and sorting via query parameters, not via custom endpoints.
/orders?status=shipped&sort=-created_atis better than/orders/shipped-by-newest. - Errors are structured. Don't just return a 400 with an HTML page. Return a JSON body with an error code, a message for humans, and a machine-readable field path for validation failures.
- Document with OpenAPI / Protobuf schemas. A contract that humans and machines both read keeps clients and servers honest. Auto-generate clients from it where possible.
Bulkhead and timeout patterns
Two smaller patterns that pair with everything else in this module and remove a class of failure that otherwise haunts every distributed system.
Bulkhead comes from shipbuilding. A ship's hull is divided into watertight compartments; if one floods, the others stay dry, and the ship floats. Applied to software: partition resources (thread pools, connection pools, memory) so a failure in one part can't drain shared resources.
The classic failure without bulkheads: your service has a 200-thread Tomcat pool. Downstream service B becomes slow. Requests to B pile up, each holding a thread. Within minutes all 200 threads are blocked on B. Now requests to other downstreams — completely healthy ones — also queue up because there are no free threads. You have a full outage caused by one slow dependency.
With bulkheads: a separate thread pool of, say, 20 threads for calls to B. If B is slow, those 20 threads pile up. The other 180 threads are still serving other work. The blast radius is contained.
No bulkhead With bulkhead
─────────── ─────────────
┌─────────────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐
│ shared pool 200 │ │ B: 20 │ │ C: 50 │ │ D: 30 │
│ all calls │ └──────────┘ └──────────┘ └──────────┘
└─────────────────┘ (B slow → only B's pool exhausts)
(one slow → all blocked)
The principle generalises beyond thread pools — separate connection pools per downstream, separate Redis clients, separate Kubernetes namespaces for blast-radius isolation.
Timeout is the most boring and most important pattern in distributed systems. Every network call must have a timeout. No exceptions.
The default HTTP client behaviour in many languages is "wait forever." If the downstream never responds, your call thread is gone until the TCP keepalive eventually decides the connection is dead — which can take minutes.
A few non-obvious points:
- Timeouts must be shorter than the caller's timeout. If your handler must respond in 1 second and you call downstream B with a 5-second timeout, the user has already given up by the time B's timeout fires.
- Set both a connect timeout and a read timeout. Connect = how long to establish the TCP connection. Read = how long to wait for data once connected. Most clients have separate config for each.
- Timeout values come from latency measurements, not guesses. If B's p99 is 80 ms, set the timeout to 200-300 ms — long enough that healthy calls succeed, short enough that unhealthy calls fail fast.
- Propagate deadlines, not just timeouts. gRPC's context carries a deadline ("this request must complete by absolute time T"). Every downstream call uses the remaining time, not the original budget. This prevents the cascade of "each hop allowed its full timeout" eating into the user's budget.
Timeouts + retries + circuit breaker + bulkhead are the four patterns that together turn a brittle distributed system into a resilient one. Each one solves a different failure mode. Most outages I have ever seen could have been prevented or contained by at least one of them being in place. The next module digs into the distributed-systems primitives — consensus, locks, quorums, clocks — that some of these patterns rely on under the hood.
⁂ Back to all modules