Home
High-Level System Design / Module 5 — Messaging & Event Systems

Messaging & Event Systems

Message queues, Kafka, pub/sub, notification design, stream processing, event sourcing and CQRS — the asynchronous backbone of large systems.


When sync calls are the wrong shape

Two services talking via HTTP request/response is the simplest possible coupling. It is also the wrong shape for a surprising amount of real work.

The problems with sync calls show up in three different ways:

Messaging gives you an asynchronous middle layer. Producer writes a message; broker stores it; consumer reads when ready. The producer is decoupled in time (consumer can be down), in space (the consumer's address isn't even known to the producer), and in cardinality (one message can fan out to many consumers).

text
   Sync coupling                       Async via queue
   ─────────────                       ───────────────
   A ──► B ──► C                       A ──► [queue] ──► B
        ↑                                       │
     A fails if B fails                         └────► C
     A waits for B                              ↑
     A knows B's address                  B and C work independently

The cost is real and worth naming up front: async makes correctness harder. Errors are now far from their cause. Ordering is harder to reason about. Idempotency becomes mandatory. The next sections cover the patterns that tame all of this.

Message queues — the basic primitive

A queue is a FIFO-ish data structure managed by a broker. Producers push messages; consumers pull (or are pushed to). The broker stores messages durably until they are acknowledged. Each message is delivered to exactly one consumer in the consumer group — work is partitioned, not replicated.

The most common queue brokers: RabbitMQ, AWS SQS, Google Cloud Tasks / Pub/Sub Lite, Azure Service Bus, ActiveMQ, Redis Streams (when used with consumer groups). Different feature mixes, same core idea.

The operations that matter:

The delivery guarantee model is critical to understand:

When to reach for a queue:

Kafka — a different beast

Kafka looks like a message queue if you squint, but it is architecturally different. Where a traditional queue is built around "deliver each message to one consumer and delete," Kafka is built around "durably store an ordered log; consumers read at their own pace; nothing is deleted until retention expires."

The core data model is the topic, divided into partitions. Each partition is an append-only log on disk. Messages within a partition have a strict, monotonically-increasing offset. Consumers read by offset and check-point their position back to the cluster.

text
   Topic: orders, 3 partitions

   Partition 0:  [msg5][msg8][msg12][msg15][msg17] ──► (grows)
   Partition 1:  [msg2][msg4][msg9][msg11][msg14]  ──► (grows)
   Partition 2:  [msg1][msg3][msg6][msg7][msg10]   ──► (grows)
                  ▲                       ▲
                  │                       └── consumer offset for group X (msg10)
                  └── retained for N days regardless of consumption

Key consequences of this design:

Kafka is what you reach for when you need:

Kafka is not the right choice when:

The pragmatic decision: Kafka if the data is an event stream others might want to subscribe to in the future, or if you need replay. A queue otherwise.

Publish-subscribe

Pub/sub is a messaging model where a publisher sends a message to a topic (not to a specific consumer), and any number of subscribers receive it. The publisher does not know or care who the subscribers are.

This is the right model when one event matters to many consumers, with different processing needs:

text
            ┌──► email-service        (send order confirmation)
            ├──► inventory-service    (decrement stock)
  Order ──► ├──► analytics-service    (record event)
  topic     ├──► fraud-service        (score for fraud)
            └──► recommendations      (update user model)

A queue is one-to-one within a consumer group. Pub/sub is one-to-many. Kafka does both depending on whether you have one consumer group or many. Dedicated pub/sub systems (Google Cloud Pub/Sub, AWS SNS, NATS, MQTT for IoT) make it the default semantics.

The properties to understand:

A design pattern that comes up constantly: the outbox pattern for reliably publishing events from a service.

text
   Within ONE database transaction:
      ┌────────────────────────────┐
      │ INSERT INTO orders ...      │
      │ INSERT INTO outbox          │
      │   (event_type, payload, ...) │
      └────────────────────────────┘

   Background process:
      Read unprocessed rows from outbox
      Publish to message broker
      Mark row as published

Why this dance? Without it, you face a dirty distributed-systems bug: you commit the order in the DB, then try to publish the event, and the publish fails. You now have an order with no event. Or worse, you publish the event, then the DB commit fails — and a downstream service reacts to an order that doesn't exist. The outbox puts the event-emission inside the database transaction. Whatever the database remembers, the event will eventually fire. Whatever the database forgets, no event was sent. The two stay consistent.

Notification service design

A notification service is the system that gets messages to users — email, SMS, push notifications, in-app banners, sometimes voice calls. It is the canonical use case for asynchronous messaging because every part of it is the wrong shape for sync calls.

The failure modes you're designing around:

A sensible architecture:

text
   Producer service ──► notifications topic
                              │
                              ▼
                       Routing service
                       (look up user prefs,
                        decide channels,
                        per-channel rate limits)
                              │
              ┌──────────────┼──────────────┐
              ▼              ▼              ▼
          email queue    push queue     sms queue
              │              │              │
              ▼              ▼              ▼
          SendGrid        FCM/APNs       Twilio
              │              │              │
              ▼              ▼              ▼
        delivery events ──► back to events topic

Few things make this work well:

The biggest mistake teams make is sending notifications synchronously from the request that triggered the event. Asynchronous from end to end is the only architecture that scales.

Stream processing

Once events flow continuously through Kafka, processing them in real time becomes possible. Stream processing is the discipline of running computations over unbounded data — clicks, transactions, sensor readings, log lines — as it arrives, in seconds rather than the hours of a nightly batch.

The canonical stream-processing operations:

Window types you'll meet:

text
   Tumbling window  (fixed, non-overlapping)
   ├─5min─┤├─5min─┤├─5min─┤

   Hopping / sliding window  (fixed, overlapping)
   ├─────10min─────┤
      ├─────10min─────┤
         ├─────10min─────┤

   Session window  (driven by user inactivity gap)
   ├──active──┤  gap  ├──active──┤

The hard parts:

Framework choices:

Reach for streaming when: the value of the result decays with time (fraud detection has minutes; daily reports have hours), you genuinely have continuous data, and a batch every 5 minutes is too slow. Don't reach for streaming when a daily batch suffices — the operational cost is higher than people remember.

Event sourcing and CQRS

Most systems store the current state of things — the user has 47 points, the order is "shipped". Event sourcing flips that: the source of truth is the log of every event that ever happened. "User signed up", "User earned 10 points", "User earned 5 points"... and the current state is what you get by replaying all the events.

text
   Traditional state                Event-sourced
   ─────────────────                ─────────────
   row: user 42                     log for user 42:
        balance: 47                   [signed_up, earned 10, earned 5,
                                       earned 50, redeemed 18, earned 0,
                                       ... derive balance = 47]

Why this is interesting:

The costs are real:

CQRS (Command Query Responsibility Segregation) is the closely-related pattern that splits the model used for writes (commands) from the model used for reads (queries). The write side might be normalised, transactional, and small; the read side might be denormalised, eventually consistent, and optimised for the specific UI shapes that need it.

text
   Command ──► validate ──► append events ──► event store
                                                  │
                                                  ▼
                                          projection workers
                                                  │
                          ┌───────────────────────┼───────────────────┐
                          ▼                       ▼                   ▼
                  read model A (Postgres)  read model B (Elasticsearch)  ...
                          ▲                       ▲
                          │                       │
                       Query                   Query

CQRS does not require event sourcing — you can have CQRS with a regular database — but the two pair naturally. Event sourcing gives you the canonical write log; CQRS lets you build whatever read models you want from it.

When to use either:

Messaging and event-driven architecture are the connective tissue of every large system. The patterns in this module — queues, Kafka, pub/sub, streams, event sourcing — show up in different combinations in nearly every system you'll design from here on. The next module zooms back out to architecture-level patterns: how services fit together, how they fail well, and how they recover.


⁂ Back to all modules