Home
Backend from First Principles / Module 27 — Distributed Systems & CAP

Distributed Systems & CAP

The 8 fallacies, CAP theorem, eventual consistency — what changes when you add a second computer.


What Makes Something 'Distributed'

A distributed system is any system that runs across more than one computer. The moment you add a second server, a load balancer, a separate database — even just a cache — you've crossed into distributed territory.

This sounds harmless. It isn't. The instant your system spans more than one machine, you inherit a long list of problems that don't exist on a single computer:

The single most useful skill in modern backend engineering is recognizing that you're crossing into distributed territory and adjusting your assumptions accordingly. Most production bugs aren't bugs in your code — they're bugs in your mental model of the system.


The 8 Fallacies of Distributed Computing

In the 1990s, engineers at Sun Microsystems wrote down the assumptions developers reflexively make about networks — and how every single one is wrong.

Text
1. The network is reliable                  ✗ Packets drop. Connections die.
2. Latency is zero                          ✗ Every call takes time.
3. Bandwidth is infinite                    ✗ Networks have finite capacity.
4. The network is secure                    ✗ Assume eavesdroppers.
5. Topology doesn't change                  ✗ Servers come and go.
6. There is one administrator               ✗ Multiple teams, multiple clouds.
7. Transport cost is zero                   ✗ Bytes cost money.
8. The network is homogeneous               ✗ Different versions, different protocols.

Every distributed system bug you'll ever debug is some flavor of "I assumed one of these but actually it wasn't true."

A simple example: in a single process, calling getUser(42) either returns a user or throws — those are the only outcomes. Across a network, there's a third possibility: it succeeds on the server, but you never hear back. Did the user get created? You don't know. This third state — "succeeded but you don't know it" — is the source of an enormous fraction of distributed bugs.

The cure isn't to memorize these fallacies. It's to develop the reflex of asking "what happens if this network call disappears halfway?" before writing any code that crosses a network boundary.


CAP Theorem — Pick Two of Three

The CAP theorem is the one piece of distributed-systems theory every backend engineer should know by heart. Eric Brewer formulated it in 2000.

A distributed system can guarantee at most two of these three properties at the same time:

The crucial insight: in any real distributed system, network partitions WILL happen. Cables get cut, switches fail, cloud regions hiccup. So P isn't really optional — you must tolerate partitions. Which means the actual choice is between C and A *during a partition*.

Text
                    ┌────────────┐
                    │ Network    │
                    │ Partition  │
                    └─────┬──────┘
                          │
              ┌───────────┴────────────┐
              │                        │
              ▼                        ▼
    Pick Consistency          Pick Availability
    (CP system)               (AP system)
              │                        │
              ▼                        ▼
   Refuse to answer          Return possibly stale
   if not 100% sure          data, reconcile later
              │                        │
              ▼                        ▼
   PostgreSQL with           Cassandra, DynamoDB,
   single primary,           Couchbase, most caches
   most banks

Real-world examples of the choice:


PACELC — The Real Trade-off

CAP only describes what happens during a network partition. But partitions are rare. What about the 99.9% of the time when the network is fine?

PACELC extends CAP with this insight:
• If a Partition happens (P), choose Availability (A) or Consistency (C).
• Else (E), even when everything's healthy, choose Latency (L) or Consistency (C).

Why is there even a tradeoff when the network is healthy? Because keeping multiple replicas perfectly in sync requires waiting for them to acknowledge each write — synchronous replication. That waiting adds latency. Asynchronous replication is much faster but can lose recent writes if a node dies before its update propagates.

Text
PostgreSQL (default):    PC + EC   — strong consistency, accepts higher latency
DynamoDB:                PA + EL   — fast and available, eventual consistency
Cassandra:               PA + EL   — same, tunable per-query
MongoDB (default):       PA + EC   — varies; depends on read/write concern

What this means for you as an engineer: every database has a default position on this matrix, and most have knobs to tune it per query. When you write a query, you're implicitly choosing where on the latency/consistency axis to sit. Senior engineers make that choice consciously.


Eventual Consistency in Practice

Eventual consistency means: "if you stop writing, eventually all replicas will agree." It does NOT mean your reads always see the latest data.

Practical implications:

You write a comment. The user immediately refreshes. The comment isn't there. They refresh again — now it's there. This is eventual consistency leaking through. The fix isn't to "fix the bug" — it's a deliberate design choice.

Common eventual-consistency situations:
• Database read replicas — primary writes; replicas lag by milliseconds
• CDN caches — origin updated; edge nodes haven't refreshed
• Search indexes — Elasticsearch reindex takes a second or two
• Cross-region databases — write in US, read in EU; replication adds latency
• Pub/Sub messaging — sender done; subscribers haven't received yet

How to design for it:

1. Read-your-own-writes — after a write, route THAT user's reads to the primary briefly. Other users can still hit replicas.

2. Optimistic UI — show the new comment immediately on the writer's screen, even before the server confirms. They never see the lag.

3. Versioning — include a version or timestamp; clients can detect "I have a newer view than this".

4. Acceptance — for some data (search results, recommendations, social feeds), eventual consistency is just fine. Don't pay for strong consistency where it doesn't matter.

The hardest skill: knowing for each piece of data whether stale-by-a-second is OK or catastrophic. This is what you're really designing when you make consistency choices.


⁂ Back to all modules