Distributed Systems & CAP
The 8 fallacies, CAP theorem, eventual consistency — what changes when you add a second computer.
What Makes Something 'Distributed'
A distributed system is any system that runs across more than one computer. The moment you add a second server, a load balancer, a separate database — even just a cache — you've crossed into distributed territory.
This sounds harmless. It isn't. The instant your system spans more than one machine, you inherit a long list of problems that don't exist on a single computer:
- A network call can succeed but the response can be lost
- Two nodes can each think they're the leader at the same time
- Time on different machines drifts (your "Tuesday at 3 PM" is not the same as another server's)
- A machine can be alive but unresponsive — neither failed nor working
- Data written to one node may not be visible on another for some milliseconds, seconds, or longer
The single most useful skill in modern backend engineering is recognizing that you're crossing into distributed territory and adjusting your assumptions accordingly. Most production bugs aren't bugs in your code — they're bugs in your mental model of the system.
The 8 Fallacies of Distributed Computing
In the 1990s, engineers at Sun Microsystems wrote down the assumptions developers reflexively make about networks — and how every single one is wrong.
1. The network is reliable ✗ Packets drop. Connections die.
2. Latency is zero ✗ Every call takes time.
3. Bandwidth is infinite ✗ Networks have finite capacity.
4. The network is secure ✗ Assume eavesdroppers.
5. Topology doesn't change ✗ Servers come and go.
6. There is one administrator ✗ Multiple teams, multiple clouds.
7. Transport cost is zero ✗ Bytes cost money.
8. The network is homogeneous ✗ Different versions, different protocols.
Every distributed system bug you'll ever debug is some flavor of "I assumed one of these but actually it wasn't true."
A simple example: in a single process, calling getUser(42) either returns a user or throws — those are the only outcomes. Across a network, there's a third possibility: it succeeds on the server, but you never hear back. Did the user get created? You don't know. This third state — "succeeded but you don't know it" — is the source of an enormous fraction of distributed bugs.
The cure isn't to memorize these fallacies. It's to develop the reflex of asking "what happens if this network call disappears halfway?" before writing any code that crosses a network boundary.
CAP Theorem — Pick Two of Three
The CAP theorem is the one piece of distributed-systems theory every backend engineer should know by heart. Eric Brewer formulated it in 2000.
A distributed system can guarantee at most two of these three properties at the same time:
- Consistency (C) — every read sees the most recent write, no matter which node you ask
- Availability (A) — every request gets a response (even if the data might be stale)
- Partition tolerance (P) — the system keeps working even when the network is broken between nodes
The crucial insight: in any real distributed system, network partitions WILL happen. Cables get cut, switches fail, cloud regions hiccup. So P isn't really optional — you must tolerate partitions. Which means the actual choice is between C and A *during a partition*.
┌────────────┐
│ Network │
│ Partition │
└─────┬──────┘
│
┌───────────┴────────────┐
│ │
▼ ▼
Pick Consistency Pick Availability
(CP system) (AP system)
│ │
▼ ▼
Refuse to answer Return possibly stale
if not 100% sure data, reconcile later
│ │
▼ ▼
PostgreSQL with Cassandra, DynamoDB,
single primary, Couchbase, most caches
most banks
Real-world examples of the choice:
- A banking system disagreeing with itself about your balance is unacceptable. Pick C, refuse to serve reads on a partitioned node.
- A social media feed showing slightly old likes for 5 seconds is acceptable. Pick A, serve what you have.
- Amazon's shopping cart famously picks A — they'd rather let two carts diverge and reconcile later than block you from adding an item.
PACELC — The Real Trade-off
CAP only describes what happens during a network partition. But partitions are rare. What about the 99.9% of the time when the network is fine?
PACELC extends CAP with this insight:
• If a Partition happens (P), choose Availability (A) or Consistency (C).
• Else (E), even when everything's healthy, choose Latency (L) or Consistency (C).
Why is there even a tradeoff when the network is healthy? Because keeping multiple replicas perfectly in sync requires waiting for them to acknowledge each write — synchronous replication. That waiting adds latency. Asynchronous replication is much faster but can lose recent writes if a node dies before its update propagates.
PostgreSQL (default): PC + EC — strong consistency, accepts higher latency
DynamoDB: PA + EL — fast and available, eventual consistency
Cassandra: PA + EL — same, tunable per-query
MongoDB (default): PA + EC — varies; depends on read/write concern
What this means for you as an engineer: every database has a default position on this matrix, and most have knobs to tune it per query. When you write a query, you're implicitly choosing where on the latency/consistency axis to sit. Senior engineers make that choice consciously.
Eventual Consistency in Practice
Eventual consistency means: "if you stop writing, eventually all replicas will agree." It does NOT mean your reads always see the latest data.
Practical implications:
You write a comment. The user immediately refreshes. The comment isn't there. They refresh again — now it's there. This is eventual consistency leaking through. The fix isn't to "fix the bug" — it's a deliberate design choice.
Common eventual-consistency situations:
• Database read replicas — primary writes; replicas lag by milliseconds
• CDN caches — origin updated; edge nodes haven't refreshed
• Search indexes — Elasticsearch reindex takes a second or two
• Cross-region databases — write in US, read in EU; replication adds latency
• Pub/Sub messaging — sender done; subscribers haven't received yet
How to design for it:
1. Read-your-own-writes — after a write, route THAT user's reads to the primary briefly. Other users can still hit replicas.
2. Optimistic UI — show the new comment immediately on the writer's screen, even before the server confirms. They never see the lag.
3. Versioning — include a version or timestamp; clients can detect "I have a newer view than this".
4. Acceptance — for some data (search results, recommendations, social feeds), eventual consistency is just fine. Don't pay for strong consistency where it doesn't matter.
The hardest skill: knowing for each piece of data whether stale-by-a-second is OK or catastrophic. This is what you're really designing when you make consistency choices.
⁂ Back to all modules