Home
Backend from First Principles / Module 31 — Idempotency & Distributed Patterns

Idempotency & Distributed Patterns

Idempotency keys, sagas, the outbox pattern — making distributed work safe in the face of retries and partial failures.


Why Idempotency is the Most Underrated Concept

In a distributed system, you cannot tell the difference between "the request failed" and "the request succeeded but I never got the response." A timeout, a dropped connection, a server that died after acting but before replying — these all look identical to the caller.

The caller's only safe option is to retry. Which means: every operation that crosses a network must be safe to call multiple times. Otherwise you'll double-charge customers, send duplicate emails, create duplicate orders, and end up with corrupted state.

An operation is idempotent if calling it N times produces the same result as calling it once. The first call does the work; subsequent calls are no-ops.

GET, PUT, DELETE in HTTP are idempotent by design. POST is not. That's why POST is the dangerous verb — and why protecting POST endpoints with idempotency is one of the most important production techniques you'll learn.


Idempotency Keys — The Pattern Stripe Made Famous

For non-idempotent operations (especially payments), the standard pattern is the idempotency key.

The client generates a unique key for each operation it intends to perform exactly once. It includes that key with every retry of the same logical operation.

Text
Client:                          Server:
  Generate key: abc123             ┌──────────────────────────┐
                                   │  Idempotency table       │
  POST /charge                     │  ─────────────────────   │
    Idempotency-Key: abc123  ──►   │  abc123 → result {ok}    │
    { amount: 5000 }               └──────────────────────────┘

  (timeout, no response)

  POST /charge  (retry)            Server sees abc123 already done
    Idempotency-Key: abc123  ──►   Returns the cached result
    { amount: 5000 }               (does NOT charge again)

The server's logic:

1. Receive request with idempotency key.
2. Check if this key exists in storage.
- If yes → return the cached response. Do nothing else.
- If no → execute the operation, save the result keyed by the idempotency key, return it.
3. The check + save must be atomic (database transaction or unique constraint).

SQL
-- Atomic insert: succeeds for first request, fails for retries
INSERT INTO idempotency_keys (key, status, created_at)
VALUES ('abc123', 'in_progress', NOW())
ON CONFLICT (key) DO NOTHING
RETURNING *;

-- If RETURNING returned no rows, this is a retry. Look up the cached result.
-- If it returned a row, this is the first request. Execute the operation.

Stripe's API requires an Idempotency-Key header for any state-changing call. It's the gold standard. Adopt it for your own payment, ordering, and notification endpoints.


The Outbox Pattern — Reliable Event Publishing

Common problem: you want to update your database AND publish an event to a message queue. Both must happen, or neither.

The naive approach:

Text
1. Save order to database
2. Publish OrderCreated event to Kafka

What goes wrong:
• Step 1 succeeds, step 2 fails (Kafka unreachable). Order exists, no event. Inventory and email never get notified. Inconsistent state.
• Step 2 succeeds, step 1 fails. Event published for an order that doesn't exist. Subscribers process a phantom.

You can't put both in a database transaction because Kafka isn't a database. So how do you make them atomic?

The outbox pattern: write the event into the SAME database, then a separate process reads it out and publishes it.

Text
   ┌─────────────────────────────────────────────────┐
   │  Single database transaction                    │
   │  ┌────────────────┐  ┌────────────────────┐     │
   │  │ orders table   │  │ outbox table       │     │
   │  │ INSERT new row │  │ INSERT event row   │     │
   │  └────────────────┘  └────────────────────┘     │
   │   Either both commit or both rollback           │
   └────────────┬────────────────────────────────────┘
                │
                ▼
        ┌────────────────────┐
        │ Outbox publisher   │  ← polls or uses CDC
        │ (separate process) │
        └────────┬───────────┘
                 │
                 ▼ publishes event, then marks the outbox row done
        ┌────────────────────┐
        │   Kafka / SNS      │
        └────────────────────┘

The outbox table holds events waiting to be published. A separate process (or Change Data Capture from the database log) reads from it and publishes to Kafka, marking each event as published once acknowledged.

Why it works:
• Database transaction is atomic — order and outbox row commit together
• If the publisher crashes before publishing, the next run picks up where it left off
• Events get published at-least-once (consumers must be idempotent — see above!)

This is the standard pattern for any "save data and emit event" workflow. Most modern frameworks have libraries for it (Debezium, Eventuous, transactional outbox).


Sagas — Long-Running Distributed Transactions

Some workflows span multiple services and can't fit in one database transaction. Booking a trip might involve flight, hotel, and car services, each with its own database. If the hotel booking fails, you need to refund the flight and release the car.

A saga is a sequence of local transactions, each of which has a compensating action that can undo it.

Text
Forward path:
  1. Book flight    → success
  2. Book hotel     → success
  3. Book car       → FAILS

Compensation path (run in reverse):
  3. (skipped — never succeeded)
  2. Cancel hotel booking
  1. Cancel flight booking

Two flavors:

Snippet
Choreography saga: each service knows which event triggers it forward and which event triggers compensation. No central coordinator.
  • Pros: no single point of failure, services stay loosely coupled
  • Cons: hard to see the whole flow; cyclic event dependencies are easy to introduce
Snippet
Orchestrated saga: a saga orchestrator service runs the workflow, calling each step and handling failures by triggering compensations.
  • Pros: clear flow, easy to debug
  • Cons: orchestrator is a god object knowing about all services

When to use sagas: workflows that span service boundaries AND need a coherent success/failure outcome. Order processing, multi-step provisioning, refund/return flows.

What sagas don't give you: actual ACID atomicity. There's a window where flight is booked but hotel hasn't been booked yet — and an outsider could see the inconsistency. Saga = "eventual consistency with rollback semantics," not "distributed transaction." For most business processes, that's fine.


Two-Phase Commit — And Why You Shouldn't Use It

Two-Phase Commit (2PC) is the classical solution for distributed transactions. It's worth knowing — mostly so you can recognize when not to use it.

How it works:

Phase 1 (Prepare): A coordinator asks every participant "can you commit this?" Each participant locks the relevant data and responds yes or no.

Phase 2 (Commit/Rollback): If everyone said yes, the coordinator tells them all to commit. If anyone said no, the coordinator tells them all to roll back.

Text
                Coordinator
                    │
    ┌───────────────┼───────────────┐
    │               │               │
    ▼               ▼               ▼
Participant A   Participant B   Participant C
"prepare?"      "prepare?"      "prepare?"
    │               │               │
    ▼               ▼               ▼
   "yes"           "yes"           "yes"
    │               │               │
    └───────────────┼───────────────┘
                    │
                "all yes →
                 commit!"
                    │
    ┌───────────────┼───────────────┐
    ▼               ▼               ▼
  commit          commit          commit

Why 2PC sounds great: real distributed atomicity. Either everyone commits or nobody does.

Why 2PC has fallen out of favor:
• Blocking on partial failure — if the coordinator dies after Phase 1 but before Phase 2, participants stay locked indefinitely waiting
• Requires all participants to support the protocol — not all do
• Performance — every transaction has 2x the round trips, plus locks held longer
• The blocking property scales poorly across regions

In practice: most modern distributed systems prefer sagas + idempotency + outbox over 2PC. Even databases that support distributed transactions (like Spanner, CockroachDB) use more sophisticated protocols (Paxos, Raft) under the hood, not classical 2PC.

The takeaway: when you read about a distributed system supporting transactions, find out HOW. If it's 2PC across services, be cautious. If it's a saga-based eventual-consistency model, that's the modern norm.


⁂ Back to all modules