Distributed Systems
Consensus (Raft and Paxos), distributed locks, quorum reads and writes, vector clocks and CRDTs, gossip, clock synchronisation — the primitives that hold distributed systems together.
The problem distributed systems solve
A single machine is a wonderful thing. Memory is shared, the clock advances monotonically, calls are reliable, and "now" means the same thing to every part of the program. None of that survives the move to multiple machines.
In a distributed system, every assumption that worked on one box has to be re-established. Machines can fail independently. The network drops packets and reorders them. Clocks drift. "What time is it?" has no single answer. "What is the current value of X?" can be answered five different ways by five replicas, all simultaneously correct from their own perspective.
The primitives in this module — consensus, locks, quorums, vector clocks, gossip, clock sync — are what we build to recover some of the guarantees we lost. None of them gets us back to single-machine simplicity. Each of them buys back one specific property, at a specific cost. Knowing which primitive solves which problem is the difference between an architecture that quietly works and one that quietly corrupts data.
A framing that helps: in a distributed system, agreement is expensive. Every protocol in this module is about getting a useful form of agreement — that all nodes see the same leader, that a write is durable, that two events were truly concurrent — and paying as little as possible for it.
Consensus — Raft and Paxos
Consensus is the problem of getting a cluster of nodes to agree on a single value, despite some of them being slow, crashed, or unreachable. Every distributed database, every replicated state machine, every coordination service (etcd, ZooKeeper, Consul) has a consensus algorithm at its core.
The problem sounds simple. Three nodes need to agree on "who is leader?" Pick a node and announce it. The issues:
- Two nodes simultaneously think they're leader (split brain — both accept writes, both diverge).
- A node was leader, crashed, came back; what now?
- Network partition: two halves of the cluster can't talk; what happens?
The FLP impossibility result (Fischer-Lynch-Paterson, 1985) proved that consensus is impossible in a fully asynchronous system with even one faulty node. Real systems get around this by using timeouts — they assume that if you haven't heard from a node in T seconds, it's dead for practical purposes. This trades guaranteed termination for the ability to make progress.
Paxos (Leslie Lamport, 1989) was the first practical consensus algorithm. It is mathematically clean and notoriously hard to understand and implement correctly. It uses a two-phase protocol — prepare and accept — with proposers, acceptors, and learners. Multi-Paxos extends it to a sequence of values (a log).
Raft (Ongaro and Ousterhout, 2014) was designed explicitly to be easier to understand and implement. It is the consensus algorithm of choice for new systems: etcd, Consul, CockroachDB, TiKV, Kafka (KRaft mode), and many more.
Raft's structure:
Roles:
Leader — exactly one at a time, handles all writes
Followers — replicate the leader's log
Candidates — temporary, during election
Election: when followers stop hearing from the leader (timeout):
1. A follower becomes a candidate, increments term
2. Asks all others for their vote
3. Wins if it gets a majority
4. Becomes leader; resumes normal operation
Replication: leader appends to its log, then:
1. Sends AppendEntries to all followers
2. Waits for majority (quorum) to acknowledge
3. Marks entry committed; tells followers next time
4. Followers apply committed entries to their state machine
Three properties that matter:
- Strong leader. All writes go through the leader. This simplifies reasoning at the cost of throughput — the leader is the bottleneck.
- Term-based voting. Each new election bumps the "term" number. A node with an older term defers to a node with a newer term. This prevents split-brain from past leaders coming back online.
- Majority quorum. A cluster of N tolerates
(N-1)/2failures. Three-node cluster tolerates 1 failure; five-node tolerates 2. Even numbers are wasted — a 4-node cluster also tolerates only 1 failure (you still need 3 to make a majority) but costs more.
When you should care about Raft:
- You're choosing a coordination service or distributed database — knowing it uses Raft tells you its failure modes.
- You're building a system that needs strong consistency on a small set of metadata (leader info, cluster config, distributed locks).
- You're trying to understand why a 3-AZ deployment with one AZ down still works, but a 2-AZ deployment can lose quorum.
When you shouldn't: don't implement Raft yourself unless you are building infrastructure software. Use etcd, ZooKeeper, or a managed equivalent. The implementation details are subtle and the failure modes are vicious.
Distributed locks
A lock in a single process is trivial — a mutex, a few atomic instructions. A lock across a distributed system is a small consensus problem in disguise. Getting it wrong corrupts data.
The naive distributed lock — "SET lock:resource value NX EX 30 in Redis" — works most of the time and fails in a way that is fascinating and instructive.
t=0: Client A acquires lock with 30s TTL
t=1: Client A starts work
t=5: Client A pauses (GC, network blip, container migration)
t=30: Lock expires; Client B acquires it; starts work
t=35: Client A wakes up, thinks it still holds the lock,
performs the protected action concurrently with B
─── disaster.
The sequence is short but the consequences are real — two payments processed instead of one, two emails sent, two slots booked. The bug is the assumption that "I have the lock" is a fact about the world, when it is only a fact about your recent past.
Fixes, in increasing rigour:
1. Fencing tokens. Every lock acquisition gets a monotonically-increasing token. Pass the token to the protected resource on every write. The resource rejects writes with a stale token.
Client A acquires lock → token 100
Client A pauses
Lock expires; Client B acquires → token 101
Client B writes with token 101 → accepted (server records last = 101)
Client A wakes up, writes with token 100 → REJECTED (stale)
This is what databases like Postgres and disks like S3 conditional puts give you. The fencing token shifts the correctness boundary from "the lock service" to "the resource itself." It is the only correct distributed-lock pattern for state-changing operations.
2. Redlock. Redis's recommended algorithm for distributed locks — acquire the lock on a majority of independent Redis nodes within a bounded clock skew. Better than single-Redis locks; still doesn't survive certain network and clock-drift scenarios. Martin Kleppmann's critique of Redlock and Antirez's response are required reading if you're going to use it.
3. Consensus-backed locks. Locks built on etcd or ZooKeeper get their correctness from Raft / ZAB. They are slower (consensus round-trip per acquire) but their guarantees are strong. Use these when correctness matters more than throughput.
The pragmatic guidance:
- For coordination ("only one worker should run this cron job"): Redis SET NX is fine. The worst case is the job runs twice and you've made it idempotent anyway.
- For mutual exclusion on state changes ("only one process can decrement this counter"): use fencing tokens against an authoritative resource. Do not trust the lock alone.
- For leader election: use etcd or ZooKeeper. They handle this with a primitive ("ephemeral key with TTL and watch") that does the right thing.
The single best rule about distributed locks: prefer to not need one. Design for idempotency and optimistic concurrency wherever you can. A distributed lock is almost always a sign that the design could be reshaped.
Quorum reads and writes
In a system with N replicas, a quorum is a configurable number of replicas that must respond before a read or a write is considered complete. By tuning the read quorum (R) and write quorum (W) against the replica count (N), you trade availability for consistency.
The magic equation: R + W > N guarantees that any read sees the latest write.
N = 5 replicas
W = 3, R = 3 → R + W = 6 > 5 → strong consistency
A write hits 3 replicas. A read hits 3 replicas.
At least one replica is in both sets, so the read sees the write.
W = 1, R = 1 → R + W = 2 ≤ 5 → eventual consistency
Writes are fast; reads are fast; replicas may disagree.
W = 5, R = 1 → R + W = 6 > 5 → strong consistency
Slow writes (must wait for all replicas). Fast reads.
Loses availability if any replica is down.
The pattern is the core of Dynamo-style databases (Cassandra, Riak, DynamoDB) and is configurable per operation. A single application can:
- Use W=3, R=1 for product views (writes must be durable; reads must be fast and don't need the very latest).
- Use W=3, R=3 for inventory counts (must see the latest write).
- Use W=ALL, R=ONE for configuration (writes are rare; reads are everywhere).
When replicas disagree, read repair kicks in: the coordinator notices the divergence on a quorum read, picks the latest version (by timestamp or vector clock), and writes it back to the stale replicas. The cluster heals itself one read at a time.
Hinted handoff is the related technique for writes. If a node that should receive a write is temporarily unreachable, another node holds a "hint" — "deliver this write to node X when it comes back." When X recovers, the hints replay.
Two subtle points that bite people:
- Sloppy quorum. Some systems (Cassandra's default) accept writes on any N nodes if the canonical N replicas are down, instead of failing. This sacrifices the R+W>N guarantee — a subsequent read might not see the write because the read goes to the canonical replicas. Don't enable sloppy quorum without understanding the consequences.
- Eventual consistency is not optional. Even with R+W>N, there is a window between when the write returns success and when the late replicas catch up. A read directed straight at a stale replica during that window can return old data. The guarantee is about overlap, not about every replica being current.
The practical takeaway: quorums are a per-operation knob. Use it deliberately. A blanket "set everything to QUORUM" is fine for most apps; the cost shows up only when you optimise specific high-volume paths.
Vector clocks and CRDTs
When two replicas accept conflicting writes for the same key, someone has to decide what to do with both versions. The naive approach is "last write wins" by timestamp — but in a distributed system, you don't have a trustworthy global timestamp. You need a different way to express "which version came first."
Vector clocks are the classical answer. Each node maintains a counter, and every version of every key carries a vector of counters — one per node.
Node A increments its counter on each write of a key:
v1: { A:1, B:0 } (A wrote)
v2: { A:2, B:0 } (A wrote again)
Node B then writes:
v3: { A:2, B:1 } (B wrote, having seen v2)
But what if B writes without seeing v2?
v3': { A:0, B:1 } (B wrote, hadn't synced with A)
Compare v2 { A:2, B:0 } with v3' { A:0, B:1 }:
Neither is a subset of the other → CONCURRENT versions
Application must resolve the conflict (merge, prefer one, ask user)
With vector clocks, you can distinguish between:
- Causal versions (one version causally followed another — pick the later).
- Concurrent versions (both replicas wrote without seeing the other — true conflict).
Cassandra famously moved away from vector clocks to plain timestamps because the application-level conflict resolution turned out to be hard to use correctly. Riak still uses them; DynamoDB exposes a related concept (version numbers per item).
CRDTs (Conflict-free Replicated Data Types) take a different approach: design data structures whose merge operation is deterministically convergent, regardless of what order the operations arrived in. Two replicas that have seen the same set of operations always end up at the same state, even if they saw them in different orders.
Simple CRDTs:
- G-counter (grow-only counter). Each node has its own slot. To increment, increment your own slot. To read, sum all slots. To merge two replicas, take the max per slot. Trivially convergent.
- OR-set (observed-remove set). Add and remove elements, tagged with unique IDs. Add
(x, tag1)and(x, tag2). Remove only removes by tag. Merge is union for adds and union for removes, with adds minus removes giving the final set. - LWW-element-set. Each element has a timestamp; merge keeps the latest timestamp per element. Simple, but suffers from clock issues.
More complex CRDTs underpin real systems: Yjs and Automerge for collaborative editors (every keystroke is a CRDT op, replicas converge), Riak's data types, Redis's CRDT module (Active-Active).
When to reach for these:
- Multi-region active-active writes where you can't afford a consensus round-trip per write.
- Collaborative offline-first apps (Figma, Linear) — every client is a replica, sometimes disconnected, must converge when they reconnect.
- High-volume counters (likes, view counts) where exact ordering doesn't matter and you want each region to write locally.
When not to: anything that needs strong ordering (financial ledgers, inventory). CRDTs give you convergence, not strong consistency.
Gossip protocols
When a cluster has hundreds or thousands of nodes, having every node know about every other node — and react to every membership change — gets expensive. The naive approaches (broadcast every change, query a central registry) either scale badly or introduce a SPOF.
Gossip is the elegant alternative. Each node periodically picks a small number of random peers and exchanges state with them. Over time — surprisingly quickly — information propagates through the entire cluster.
Round 1: Node A talks to Node B
A and B now both know each other's view
Round 2: A talks to C, B talks to D
Now A, B, C, D have partial overlap
Round 3: A→E, B→F, C→G, D→H
Spreading exponentially — each round roughly doubles informed nodes
After log₂(N) rounds, information reaches the whole cluster
The math is the same as epidemic spread: an exponential growth in informed nodes, leading to whole-cluster awareness in O(log N) rounds. A 1000-node cluster converges in about 10 rounds. If each round is 1 second, full propagation is 10 seconds — for any size cluster.
What gossip is used for:
- Membership and failure detection. "Which nodes are alive?" Cassandra, Consul, Serf, Hashicorp's memberlist all use gossip-based failure detection. Each node tracks peers' heartbeats and propagates suspected failures.
- Configuration spread. Gossip a config change to all nodes without a central distribution server.
- CRDT state propagation. The increments and tombstones of a CRDT can be gossiped — eventually all replicas see all updates.
Gossip is eventually consistent by design. It is great for converging facts that don't need instant agreement (membership, soft state) and bad for facts that do ("who is the leader?"). Use consensus for the latter, gossip for the former — often in the same system.
The SWIM protocol (and its descendant Lifeguard, used by HashiCorp's memberlist) is the modern reference for gossip-based failure detection. Worth reading if you're going to run a cluster that uses it under the hood — the failure detector's behaviour during partitions is the source of many surprising operational moments.
Clock synchronisation
"What time is it?" is the question that ruins distributed systems. Different nodes have different clocks; they drift apart; they sometimes jump forward or backward. A naive now() in a distributed log gives you events that arrive after they happened, conflicting timestamps for ordering, and bugs that only show up when one node's NTP daemon is unhappy.
The baseline mechanisms:
NTP (Network Time Protocol). Public NTP servers (pool.ntp.org) sync your clock to within ~5-50 ms of real time over the internet, and within 1 ms on a LAN. Good enough for logging, scheduling, and most non-critical timestamping. Susceptible to leap seconds, network delays, and occasional bad upstream sources.
PTP (Precision Time Protocol). Hardware-assisted clock sync over a LAN, accurate to microseconds. Used in finance (Nasdaq) and telecom (5G base stations). Requires PTP-aware switches.
Google's TrueTime. Used by Spanner. Combines GPS receivers and atomic clocks in each datacentre to produce a time interval — "the current time is between T and T + ε" — with ε typically 1-7 ms. Spanner uses this for globally-consistent transaction ordering: a transaction waits out the uncertainty interval before committing, guaranteeing that any later transaction will see a strictly-greater timestamp.
AWS Time Sync Service / Amazon Time Sync (PTP-accurate). AWS now exposes microsecond-accurate time via PTP on EC2 instances in certain configurations.
What to do with these tools:
- Don't trust monotonic ordering across machines. Two events on two machines with timestamps 100 ms apart could have happened in either order if NTP drift is bigger than 100 ms. Use logical clocks (vector clocks, Lamport timestamps) when you need cross-machine ordering.
- Use the OS's monotonic clock for measuring durations ("how long did this RPC take?"). Wall-clock time can jump backward when NTP corrects; monotonic time is guaranteed not to.
- Be aware of clock skew in your protocols. A token expiring at "2026-05-21 12:00:00" can be rejected by a server whose clock is one second ahead. Allow a small skew tolerance (5-30 seconds) on time-based validations.
- Logical clocks for causality. Lamport timestamps are a single counter per node; vector clocks track per-node counters. Either gives you ordering that respects causality without needing physical time agreement.
A hard-won lesson: never order events across machines using wall-clock timestamps alone. A user's updated_at of "yesterday" might genuinely be more recent than another's updated_at of "tomorrow" if the clocks disagree. If ordering matters, attach a logical clock or a sequence number from a single source of truth.
The distributed-systems primitives in this module are not application code most of the time — they live inside databases, coordination services, and infrastructure libraries. But they leak through. The retry budget in Module 6, the quorum tuning in Module 2, the saga's eventual consistency in Module 5, all rest on the realities of consensus, quorums, and clocks covered here. When a production weirdness shows up, the answer is almost always somewhere in this module. The next module puts all of it together — designing real systems end to end.
⁂ Back to all modules