Home
High-Level System Design / Module 7 — Distributed Systems

Distributed Systems

Consensus (Raft and Paxos), distributed locks, quorum reads and writes, vector clocks and CRDTs, gossip, clock synchronisation — the primitives that hold distributed systems together.


The problem distributed systems solve

A single machine is a wonderful thing. Memory is shared, the clock advances monotonically, calls are reliable, and "now" means the same thing to every part of the program. None of that survives the move to multiple machines.

In a distributed system, every assumption that worked on one box has to be re-established. Machines can fail independently. The network drops packets and reorders them. Clocks drift. "What time is it?" has no single answer. "What is the current value of X?" can be answered five different ways by five replicas, all simultaneously correct from their own perspective.

The primitives in this module — consensus, locks, quorums, vector clocks, gossip, clock sync — are what we build to recover some of the guarantees we lost. None of them gets us back to single-machine simplicity. Each of them buys back one specific property, at a specific cost. Knowing which primitive solves which problem is the difference between an architecture that quietly works and one that quietly corrupts data.

A framing that helps: in a distributed system, agreement is expensive. Every protocol in this module is about getting a useful form of agreement — that all nodes see the same leader, that a write is durable, that two events were truly concurrent — and paying as little as possible for it.

Consensus — Raft and Paxos

Consensus is the problem of getting a cluster of nodes to agree on a single value, despite some of them being slow, crashed, or unreachable. Every distributed database, every replicated state machine, every coordination service (etcd, ZooKeeper, Consul) has a consensus algorithm at its core.

The problem sounds simple. Three nodes need to agree on "who is leader?" Pick a node and announce it. The issues:

The FLP impossibility result (Fischer-Lynch-Paterson, 1985) proved that consensus is impossible in a fully asynchronous system with even one faulty node. Real systems get around this by using timeouts — they assume that if you haven't heard from a node in T seconds, it's dead for practical purposes. This trades guaranteed termination for the ability to make progress.

Paxos (Leslie Lamport, 1989) was the first practical consensus algorithm. It is mathematically clean and notoriously hard to understand and implement correctly. It uses a two-phase protocol — prepare and accept — with proposers, acceptors, and learners. Multi-Paxos extends it to a sequence of values (a log).

Raft (Ongaro and Ousterhout, 2014) was designed explicitly to be easier to understand and implement. It is the consensus algorithm of choice for new systems: etcd, Consul, CockroachDB, TiKV, Kafka (KRaft mode), and many more.

Raft's structure:

text
   Roles:
     Leader     — exactly one at a time, handles all writes
     Followers  — replicate the leader's log
     Candidates — temporary, during election

   Election: when followers stop hearing from the leader (timeout):
     1. A follower becomes a candidate, increments term
     2. Asks all others for their vote
     3. Wins if it gets a majority
     4. Becomes leader; resumes normal operation

   Replication: leader appends to its log, then:
     1. Sends AppendEntries to all followers
     2. Waits for majority (quorum) to acknowledge
     3. Marks entry committed; tells followers next time
     4. Followers apply committed entries to their state machine

Three properties that matter:

When you should care about Raft:

When you shouldn't: don't implement Raft yourself unless you are building infrastructure software. Use etcd, ZooKeeper, or a managed equivalent. The implementation details are subtle and the failure modes are vicious.

Distributed locks

A lock in a single process is trivial — a mutex, a few atomic instructions. A lock across a distributed system is a small consensus problem in disguise. Getting it wrong corrupts data.

The naive distributed lock — "SET lock:resource value NX EX 30 in Redis" — works most of the time and fails in a way that is fascinating and instructive.

text
   t=0:   Client A acquires lock with 30s TTL
   t=1:   Client A starts work
   t=5:   Client A pauses (GC, network blip, container migration)
   t=30:  Lock expires; Client B acquires it; starts work
   t=35:  Client A wakes up, thinks it still holds the lock,
          performs the protected action concurrently with B
   ─── disaster.

The sequence is short but the consequences are real — two payments processed instead of one, two emails sent, two slots booked. The bug is the assumption that "I have the lock" is a fact about the world, when it is only a fact about your recent past.

Fixes, in increasing rigour:

1. Fencing tokens. Every lock acquisition gets a monotonically-increasing token. Pass the token to the protected resource on every write. The resource rejects writes with a stale token.

text
   Client A acquires lock → token 100
   Client A pauses
   Lock expires; Client B acquires → token 101
   Client B writes with token 101 → accepted (server records last = 101)
   Client A wakes up, writes with token 100 → REJECTED (stale)

This is what databases like Postgres and disks like S3 conditional puts give you. The fencing token shifts the correctness boundary from "the lock service" to "the resource itself." It is the only correct distributed-lock pattern for state-changing operations.

2. Redlock. Redis's recommended algorithm for distributed locks — acquire the lock on a majority of independent Redis nodes within a bounded clock skew. Better than single-Redis locks; still doesn't survive certain network and clock-drift scenarios. Martin Kleppmann's critique of Redlock and Antirez's response are required reading if you're going to use it.

3. Consensus-backed locks. Locks built on etcd or ZooKeeper get their correctness from Raft / ZAB. They are slower (consensus round-trip per acquire) but their guarantees are strong. Use these when correctness matters more than throughput.

The pragmatic guidance:

The single best rule about distributed locks: prefer to not need one. Design for idempotency and optimistic concurrency wherever you can. A distributed lock is almost always a sign that the design could be reshaped.

Quorum reads and writes

In a system with N replicas, a quorum is a configurable number of replicas that must respond before a read or a write is considered complete. By tuning the read quorum (R) and write quorum (W) against the replica count (N), you trade availability for consistency.

The magic equation: R + W > N guarantees that any read sees the latest write.

text
   N = 5 replicas
   W = 3, R = 3  → R + W = 6 > 5  → strong consistency
     A write hits 3 replicas. A read hits 3 replicas.
     At least one replica is in both sets, so the read sees the write.

   W = 1, R = 1  → R + W = 2 ≤ 5  → eventual consistency
     Writes are fast; reads are fast; replicas may disagree.

   W = 5, R = 1  → R + W = 6 > 5  → strong consistency
     Slow writes (must wait for all replicas). Fast reads.
     Loses availability if any replica is down.

The pattern is the core of Dynamo-style databases (Cassandra, Riak, DynamoDB) and is configurable per operation. A single application can:

When replicas disagree, read repair kicks in: the coordinator notices the divergence on a quorum read, picks the latest version (by timestamp or vector clock), and writes it back to the stale replicas. The cluster heals itself one read at a time.

Hinted handoff is the related technique for writes. If a node that should receive a write is temporarily unreachable, another node holds a "hint" — "deliver this write to node X when it comes back." When X recovers, the hints replay.

Two subtle points that bite people:

The practical takeaway: quorums are a per-operation knob. Use it deliberately. A blanket "set everything to QUORUM" is fine for most apps; the cost shows up only when you optimise specific high-volume paths.

Vector clocks and CRDTs

When two replicas accept conflicting writes for the same key, someone has to decide what to do with both versions. The naive approach is "last write wins" by timestamp — but in a distributed system, you don't have a trustworthy global timestamp. You need a different way to express "which version came first."

Vector clocks are the classical answer. Each node maintains a counter, and every version of every key carries a vector of counters — one per node.

text
   Node A increments its counter on each write of a key:
      v1: { A:1, B:0 }   (A wrote)
      v2: { A:2, B:0 }   (A wrote again)

   Node B then writes:
      v3: { A:2, B:1 }   (B wrote, having seen v2)

   But what if B writes without seeing v2?
      v3': { A:0, B:1 }  (B wrote, hadn't synced with A)

   Compare v2 { A:2, B:0 } with v3' { A:0, B:1 }:
     Neither is a subset of the other → CONCURRENT versions
     Application must resolve the conflict (merge, prefer one, ask user)

With vector clocks, you can distinguish between:

Cassandra famously moved away from vector clocks to plain timestamps because the application-level conflict resolution turned out to be hard to use correctly. Riak still uses them; DynamoDB exposes a related concept (version numbers per item).

CRDTs (Conflict-free Replicated Data Types) take a different approach: design data structures whose merge operation is deterministically convergent, regardless of what order the operations arrived in. Two replicas that have seen the same set of operations always end up at the same state, even if they saw them in different orders.

Simple CRDTs:

More complex CRDTs underpin real systems: Yjs and Automerge for collaborative editors (every keystroke is a CRDT op, replicas converge), Riak's data types, Redis's CRDT module (Active-Active).

When to reach for these:

When not to: anything that needs strong ordering (financial ledgers, inventory). CRDTs give you convergence, not strong consistency.

Gossip protocols

When a cluster has hundreds or thousands of nodes, having every node know about every other node — and react to every membership change — gets expensive. The naive approaches (broadcast every change, query a central registry) either scale badly or introduce a SPOF.

Gossip is the elegant alternative. Each node periodically picks a small number of random peers and exchanges state with them. Over time — surprisingly quickly — information propagates through the entire cluster.

text
   Round 1: Node A talks to Node B
     A and B now both know each other's view

   Round 2: A talks to C, B talks to D
     Now A, B, C, D have partial overlap

   Round 3: A→E, B→F, C→G, D→H
     Spreading exponentially — each round roughly doubles informed nodes

   After log₂(N) rounds, information reaches the whole cluster

The math is the same as epidemic spread: an exponential growth in informed nodes, leading to whole-cluster awareness in O(log N) rounds. A 1000-node cluster converges in about 10 rounds. If each round is 1 second, full propagation is 10 seconds — for any size cluster.

What gossip is used for:

Gossip is eventually consistent by design. It is great for converging facts that don't need instant agreement (membership, soft state) and bad for facts that do ("who is the leader?"). Use consensus for the latter, gossip for the former — often in the same system.

The SWIM protocol (and its descendant Lifeguard, used by HashiCorp's memberlist) is the modern reference for gossip-based failure detection. Worth reading if you're going to run a cluster that uses it under the hood — the failure detector's behaviour during partitions is the source of many surprising operational moments.

Clock synchronisation

"What time is it?" is the question that ruins distributed systems. Different nodes have different clocks; they drift apart; they sometimes jump forward or backward. A naive now() in a distributed log gives you events that arrive after they happened, conflicting timestamps for ordering, and bugs that only show up when one node's NTP daemon is unhappy.

The baseline mechanisms:

NTP (Network Time Protocol). Public NTP servers (pool.ntp.org) sync your clock to within ~5-50 ms of real time over the internet, and within 1 ms on a LAN. Good enough for logging, scheduling, and most non-critical timestamping. Susceptible to leap seconds, network delays, and occasional bad upstream sources.

PTP (Precision Time Protocol). Hardware-assisted clock sync over a LAN, accurate to microseconds. Used in finance (Nasdaq) and telecom (5G base stations). Requires PTP-aware switches.

Google's TrueTime. Used by Spanner. Combines GPS receivers and atomic clocks in each datacentre to produce a time interval — "the current time is between T and T + ε" — with ε typically 1-7 ms. Spanner uses this for globally-consistent transaction ordering: a transaction waits out the uncertainty interval before committing, guaranteeing that any later transaction will see a strictly-greater timestamp.

AWS Time Sync Service / Amazon Time Sync (PTP-accurate). AWS now exposes microsecond-accurate time via PTP on EC2 instances in certain configurations.

What to do with these tools:

A hard-won lesson: never order events across machines using wall-clock timestamps alone. A user's updated_at of "yesterday" might genuinely be more recent than another's updated_at of "tomorrow" if the clocks disagree. If ordering matters, attach a logical clock or a sequence number from a single source of truth.

The distributed-systems primitives in this module are not application code most of the time — they live inside databases, coordination services, and infrastructure libraries. But they leak through. The retry budget in Module 6, the quorum tuning in Module 2, the saga's eventual consistency in Module 5, all rest on the realities of consensus, quorums, and clocks covered here. When a production weirdness shows up, the answer is almost always somewhere in this module. The next module puts all of it together — designing real systems end to end.


⁂ Back to all modules