Idempotency & Distributed Patterns
Idempotency keys, sagas, the outbox pattern — making distributed work safe in the face of retries and partial failures.
Why Idempotency is the Most Underrated Concept
In a distributed system, you cannot tell the difference between "the request failed" and "the request succeeded but I never got the response." A timeout, a dropped connection, a server that died after acting but before replying — these all look identical to the caller.
The caller's only safe option is to retry. Which means: every operation that crosses a network must be safe to call multiple times. Otherwise you'll double-charge customers, send duplicate emails, create duplicate orders, and end up with corrupted state.
An operation is idempotent if calling it N times produces the same result as calling it once. The first call does the work; subsequent calls are no-ops.
GET, PUT, DELETE in HTTP are idempotent by design. POST is not. That's why POST is the dangerous verb — and why protecting POST endpoints with idempotency is one of the most important production techniques you'll learn.
Idempotency Keys — The Pattern Stripe Made Famous
For non-idempotent operations (especially payments), the standard pattern is the idempotency key.
The client generates a unique key for each operation it intends to perform exactly once. It includes that key with every retry of the same logical operation.
Client: Server:
Generate key: abc123 ┌──────────────────────────┐
│ Idempotency table │
POST /charge │ ───────────────────── │
Idempotency-Key: abc123 ──► │ abc123 → result {ok} │
{ amount: 5000 } └──────────────────────────┘
(timeout, no response)
POST /charge (retry) Server sees abc123 already done
Idempotency-Key: abc123 ──► Returns the cached result
{ amount: 5000 } (does NOT charge again)
The server's logic:
1. Receive request with idempotency key.
2. Check if this key exists in storage.
- If yes → return the cached response. Do nothing else.
- If no → execute the operation, save the result keyed by the idempotency key, return it.
3. The check + save must be atomic (database transaction or unique constraint).
-- Atomic insert: succeeds for first request, fails for retries
INSERT INTO idempotency_keys (key, status, created_at)
VALUES ('abc123', 'in_progress', NOW())
ON CONFLICT (key) DO NOTHING
RETURNING *;
-- If RETURNING returned no rows, this is a retry. Look up the cached result.
-- If it returned a row, this is the first request. Execute the operation.
Stripe's API requires an Idempotency-Key header for any state-changing call. It's the gold standard. Adopt it for your own payment, ordering, and notification endpoints.
The Outbox Pattern — Reliable Event Publishing
Common problem: you want to update your database AND publish an event to a message queue. Both must happen, or neither.
The naive approach:
1. Save order to database
2. Publish OrderCreated event to Kafka
What goes wrong:
• Step 1 succeeds, step 2 fails (Kafka unreachable). Order exists, no event. Inventory and email never get notified. Inconsistent state.
• Step 2 succeeds, step 1 fails. Event published for an order that doesn't exist. Subscribers process a phantom.
You can't put both in a database transaction because Kafka isn't a database. So how do you make them atomic?
The outbox pattern: write the event into the SAME database, then a separate process reads it out and publishes it.
┌─────────────────────────────────────────────────┐
│ Single database transaction │
│ ┌────────────────┐ ┌────────────────────┐ │
│ │ orders table │ │ outbox table │ │
│ │ INSERT new row │ │ INSERT event row │ │
│ └────────────────┘ └────────────────────┘ │
│ Either both commit or both rollback │
└────────────┬────────────────────────────────────┘
│
▼
┌────────────────────┐
│ Outbox publisher │ ← polls or uses CDC
│ (separate process) │
└────────┬───────────┘
│
▼ publishes event, then marks the outbox row done
┌────────────────────┐
│ Kafka / SNS │
└────────────────────┘
The outbox table holds events waiting to be published. A separate process (or Change Data Capture from the database log) reads from it and publishes to Kafka, marking each event as published once acknowledged.
Why it works:
• Database transaction is atomic — order and outbox row commit together
• If the publisher crashes before publishing, the next run picks up where it left off
• Events get published at-least-once (consumers must be idempotent — see above!)
This is the standard pattern for any "save data and emit event" workflow. Most modern frameworks have libraries for it (Debezium, Eventuous, transactional outbox).
Sagas — Long-Running Distributed Transactions
Some workflows span multiple services and can't fit in one database transaction. Booking a trip might involve flight, hotel, and car services, each with its own database. If the hotel booking fails, you need to refund the flight and release the car.
A saga is a sequence of local transactions, each of which has a compensating action that can undo it.
Forward path:
1. Book flight → success
2. Book hotel → success
3. Book car → FAILS
Compensation path (run in reverse):
3. (skipped — never succeeded)
2. Cancel hotel booking
1. Cancel flight booking
Two flavors:
Choreography saga: each service knows which event triggers it forward and which event triggers compensation. No central coordinator.
• Pros: no single point of failure, services stay loosely coupled
• Cons: hard to see the whole flow; cyclic event dependencies are easy to introduce
Orchestrated saga: a saga orchestrator service runs the workflow, calling each step and handling failures by triggering compensations.
• Pros: clear flow, easy to debug
• Cons: orchestrator is a god object knowing about all services
When to use sagas: workflows that span service boundaries AND need a coherent success/failure outcome. Order processing, multi-step provisioning, refund/return flows.
What sagas don't give you: actual ACID atomicity. There's a window where flight is booked but hotel hasn't been booked yet — and an outsider could see the inconsistency. Saga = "eventual consistency with rollback semantics," not "distributed transaction." For most business processes, that's fine.
Two-Phase Commit — And Why You Shouldn't Use It
Two-Phase Commit (2PC) is the classical solution for distributed transactions. It's worth knowing — mostly so you can recognize when not to use it.
How it works:
Phase 1 (Prepare): A coordinator asks every participant "can you commit this?" Each participant locks the relevant data and responds yes or no.
Phase 2 (Commit/Rollback): If everyone said yes, the coordinator tells them all to commit. If anyone said no, the coordinator tells them all to roll back.
Coordinator
│
┌───────────────┼───────────────┐
│ │ │
▼ ▼ ▼
Participant A Participant B Participant C
"prepare?" "prepare?" "prepare?"
│ │ │
▼ ▼ ▼
"yes" "yes" "yes"
│ │ │
└───────────────┼───────────────┘
│
"all yes →
commit!"
│
┌───────────────┼───────────────┐
▼ ▼ ▼
commit commit commit
Why 2PC sounds great: real distributed atomicity. Either everyone commits or nobody does.
Why 2PC has fallen out of favor:
• Blocking on partial failure — if the coordinator dies after Phase 1 but before Phase 2, participants stay locked indefinitely waiting
• Requires all participants to support the protocol — not all do
• Performance — every transaction has 2x the round trips, plus locks held longer
• The blocking property scales poorly across regions
In practice: most modern distributed systems prefer sagas + idempotency + outbox over 2PC. Even databases that support distributed transactions (like Spanner, CockroachDB) use more sophisticated protocols (Paxos, Raft) under the hood, not classical 2PC.
The takeaway: when you read about a distributed system supporting transactions, find out HOW. If it's 2PC across services, be cautious. If it's a saga-based eventual-consistency model, that's the modern norm.
⁂ Back to all modules