Messaging & Event Systems
Message queues, Kafka, pub/sub, notification design, stream processing, event sourcing and CQRS — the asynchronous backbone of large systems.
When sync calls are the wrong shape
Two services talking via HTTP request/response is the simplest possible coupling. It is also the wrong shape for a surprising amount of real work.
The problems with sync calls show up in three different ways:
- Coupling on availability. If service A calls service B synchronously, then B being down means A is down. The user's request fails because some downstream system that's not even on their critical path is having a bad day.
- Coupling on capacity. If B is slow, A's request threads pile up waiting for B. A can run out of threads or memory even though A itself is fine. This is how outages cascade.
- Fan-out. When one event needs to trigger work in five services (order placed → charge card, ship inventory, send email, update analytics, notify warehouse), making five sync calls in a row is fragile and slow. One failure aborts the whole chain. The user waits for all five to complete.
Messaging gives you an asynchronous middle layer. Producer writes a message; broker stores it; consumer reads when ready. The producer is decoupled in time (consumer can be down), in space (the consumer's address isn't even known to the producer), and in cardinality (one message can fan out to many consumers).
Sync coupling Async via queue
───────────── ───────────────
A ──► B ──► C A ──► [queue] ──► B
↑ │
A fails if B fails └────► C
A waits for B ↑
A knows B's address B and C work independently
The cost is real and worth naming up front: async makes correctness harder. Errors are now far from their cause. Ordering is harder to reason about. Idempotency becomes mandatory. The next sections cover the patterns that tame all of this.
Message queues — the basic primitive
A queue is a FIFO-ish data structure managed by a broker. Producers push messages; consumers pull (or are pushed to). The broker stores messages durably until they are acknowledged. Each message is delivered to exactly one consumer in the consumer group — work is partitioned, not replicated.
The most common queue brokers: RabbitMQ, AWS SQS, Google Cloud Tasks / Pub/Sub Lite, Azure Service Bus, ActiveMQ, Redis Streams (when used with consumer groups). Different feature mixes, same core idea.
The operations that matter:
- Acknowledgement (ack). After a consumer successfully processes a message, it acks. The broker deletes the message. If the consumer crashes before acking, the broker redelivers (after a visibility timeout) so the message is not lost.
- Visibility timeout / lease. When a consumer picks up a message, the broker hides it from other consumers for N seconds. If the consumer doesn't ack within N seconds, the message becomes visible again. This is what gives "at-least-once" delivery — every message is processed at least one time, possibly more.
- Dead-letter queue (DLQ). A message that fails repeatedly (after N retries) is moved to a DLQ for manual investigation. Without a DLQ, a poison message can block a queue forever as the broker keeps redelivering it.
- Backpressure. When consumers are slow, messages pile up. Queue depth is the metric to alert on. The broker can apply backpressure to producers (rejecting writes) or just let the queue grow until consumers catch up.
The delivery guarantee model is critical to understand:
- At-most-once. Fire and forget. Message may be lost. Almost never what you want.
- At-least-once. The message is delivered one or more times. Standard for most queues. Requires idempotent consumers. If "charge card $50" arrives twice, the consumer must charge only once.
- Exactly-once. The message is delivered exactly one time. Heavily marketed; in distributed systems, achievable only with significant constraints (transactional outbox patterns, idempotency keys, deduplication windows). Kafka offers it via transactions; SQS offers FIFO queues with deduplication IDs. In practice, design for at-least-once and make consumers idempotent. It is simpler, more honest, and works.
When to reach for a queue:
- Decoupling a slow operation from a user-facing request (email sending, image processing).
- Smoothing burst traffic into a downstream system that can only process N/sec.
- Reliably retrying work that interacts with flaky external APIs.
- Coordinating handoffs between services that have different uptime characteristics.
Kafka — a different beast
Kafka looks like a message queue if you squint, but it is architecturally different. Where a traditional queue is built around "deliver each message to one consumer and delete," Kafka is built around "durably store an ordered log; consumers read at their own pace; nothing is deleted until retention expires."
The core data model is the topic, divided into partitions. Each partition is an append-only log on disk. Messages within a partition have a strict, monotonically-increasing offset. Consumers read by offset and check-point their position back to the cluster.
Topic: orders, 3 partitions
Partition 0: [msg5][msg8][msg12][msg15][msg17] ──► (grows)
Partition 1: [msg2][msg4][msg9][msg11][msg14] ──► (grows)
Partition 2: [msg1][msg3][msg6][msg7][msg10] ──► (grows)
▲ ▲
│ └── consumer offset for group X (msg10)
└── retained for N days regardless of consumption
Key consequences of this design:
- Replay. A consumer can re-read history ("reset my offset to 7 days ago"). This is impossible in a traditional queue where messages are deleted on ack. It is essential for rebuilding state in event-sourcing and for recovering from consumer bugs.
- Many consumer groups, one topic. Three services can all read the same orders topic independently. Each has its own offset; one's progress doesn't affect the others. This is fundamentally fan-out behaviour, baked into the broker.
- Ordering inside a partition; not across partitions. All messages with key K go to the same partition (
partition = hash(K) % partition_count). Order is preserved within a partition, not across. Choose your key carefully — if you key byuser_id, all of one user's events are ordered; events across users may interleave. - High throughput. Sequential writes to disk; zero-copy reads; per-partition parallelism. Production Kafka clusters routinely handle millions of messages per second.
- Retention by time, not consumption. Default retention is 7 days. Messages older than that are deleted regardless of who has or hasn't consumed them.
Kafka is what you reach for when you need:
- Event-driven architecture as a backbone. All services write events to Kafka; all services that care subscribe. The topic is the integration contract.
- Stream processing. Kafka + ksqlDB / Flink / Spark Streaming processes events in real time (windowed aggregations, joins, anomaly detection).
- Change-data-capture. Debezium or similar tails the database's WAL into Kafka, turning every DB write into an event other systems can react to.
- Log aggregation, audit trails. A durable, replayable, partitioned log is exactly the shape you want.
Kafka is not the right choice when:
- You only need simple task queueing (use SQS / RabbitMQ — much lower operational cost).
- Low latency per message is paramount (Kafka batches for throughput; sub-ms latency is not its game).
- You need point-to-point work distribution with exactly-one consumer per message and complex routing rules (RabbitMQ's exchanges fit this better).
The pragmatic decision: Kafka if the data is an event stream others might want to subscribe to in the future, or if you need replay. A queue otherwise.
Publish-subscribe
Pub/sub is a messaging model where a publisher sends a message to a topic (not to a specific consumer), and any number of subscribers receive it. The publisher does not know or care who the subscribers are.
This is the right model when one event matters to many consumers, with different processing needs:
┌──► email-service (send order confirmation)
├──► inventory-service (decrement stock)
Order ──► ├──► analytics-service (record event)
topic ├──► fraud-service (score for fraud)
└──► recommendations (update user model)
A queue is one-to-one within a consumer group. Pub/sub is one-to-many. Kafka does both depending on whether you have one consumer group or many. Dedicated pub/sub systems (Google Cloud Pub/Sub, AWS SNS, NATS, MQTT for IoT) make it the default semantics.
The properties to understand:
- Subscriber-defined retention. Pub/sub systems usually drop messages once they've been delivered to all known subscribers (or after a TTL). A new subscriber added later does not see history — this is where Kafka's append-log model differs.
- Filtering. Subscribers often specify what they want via topic patterns (
orders.*.placed) or message attributes. The broker does the filtering, not the subscriber, so unwanted messages don't traverse the network. - Push vs pull. Some pub/sub systems push to subscribers (HTTP callbacks, gRPC streams); others have subscribers pull. Pull is more reliable under consumer backpressure; push is simpler but needs careful timeout and retry handling.
A design pattern that comes up constantly: the outbox pattern for reliably publishing events from a service.
Within ONE database transaction:
┌────────────────────────────┐
│ INSERT INTO orders ... │
│ INSERT INTO outbox │
│ (event_type, payload, ...) │
└────────────────────────────┘
Background process:
Read unprocessed rows from outbox
Publish to message broker
Mark row as published
Why this dance? Without it, you face a dirty distributed-systems bug: you commit the order in the DB, then try to publish the event, and the publish fails. You now have an order with no event. Or worse, you publish the event, then the DB commit fails — and a downstream service reacts to an order that doesn't exist. The outbox puts the event-emission inside the database transaction. Whatever the database remembers, the event will eventually fire. Whatever the database forgets, no event was sent. The two stay consistent.
Notification service design
A notification service is the system that gets messages to users — email, SMS, push notifications, in-app banners, sometimes voice calls. It is the canonical use case for asynchronous messaging because every part of it is the wrong shape for sync calls.
The failure modes you're designing around:
- The external providers are flaky. SendGrid, Twilio, FCM, APNs — all have rate limits, occasional outages, and varying latency. Your service cannot wait synchronously for them.
- The fan-out is large. A single "new comment" event might generate notifications for 100 followers. "App update" might generate millions.
- The same notification has many channels. "You have a new message" needs to become: push to phone, in-app banner if app is open, email if push was not acknowledged in 5 minutes. Channel selection is its own logic.
- User preferences gate everything. Don't send marketing emails to users who unsubscribed. Don't send during quiet hours in the user's timezone. Don't send via a channel the user has disabled.
A sensible architecture:
Producer service ──► notifications topic
│
▼
Routing service
(look up user prefs,
decide channels,
per-channel rate limits)
│
┌──────────────┼──────────────┐
▼ ▼ ▼
email queue push queue sms queue
│ │ │
▼ ▼ ▼
SendGrid FCM/APNs Twilio
│ │ │
▼ ▼ ▼
delivery events ──► back to events topic
Few things make this work well:
- Each channel gets its own queue with its own rate limit to respect provider quotas. If one channel is throttled, others keep flowing.
- Idempotency on (user_id, notification_id, channel) so retries don't double-send. The user gets one push for one event, not three because the queue redelivered.
- Templates server-side, parameters with the message — a notification is
{template: "new_follow", to: "user:42", params: {follower_id: 99}}, not pre-rendered HTML. Lets you A/B test, internationalise, and fix typos without a rebuild. - Delivery events flow back into the system. Bounce notifications, opens, clicks — they feed analytics, retry logic, and unsubscribe enforcement.
- Quiet hours and grouping. Don't send each notification immediately; group within short windows where it makes sense ("3 people liked your post" instead of three separate pushes).
The biggest mistake teams make is sending notifications synchronously from the request that triggered the event. Asynchronous from end to end is the only architecture that scales.
Stream processing
Once events flow continuously through Kafka, processing them in real time becomes possible. Stream processing is the discipline of running computations over unbounded data — clicks, transactions, sensor readings, log lines — as it arrives, in seconds rather than the hours of a nightly batch.
The canonical stream-processing operations:
- Filter / map / transform. Drop events you don't care about; reshape ones you do. Stateless and trivial — no different from doing it in a batch.
- Aggregations over windows. Count of clicks per user per 5-minute window. Sum of order amounts per merchant per hour. Stateful — requires the processor to hold per-key counters.
- Joins. Enrich an order event with the user's profile ("orders ⨝ users"). Or join two streams within a time window ("clicks ⨝ purchases within 30 minutes").
- Pattern detection. "User logged in from 3 different countries in 1 hour." CEP (complex event processing) engines specialise in this.
Window types you'll meet:
Tumbling window (fixed, non-overlapping)
├─5min─┤├─5min─┤├─5min─┤
Hopping / sliding window (fixed, overlapping)
├─────10min─────┤
├─────10min─────┤
├─────10min─────┤
Session window (driven by user inactivity gap)
├──active──┤ gap ├──active──┤
The hard parts:
- Late events. A click that happened at 12:01 might arrive at 12:08 because the user's phone was offline. Did the 12:00-12:05 window already "close"? Frameworks handle this with watermarks ("we believe we've seen all events up to time T") and allowed lateness ("keep the window open for 10 more minutes").
- State management. A windowed count of clicks per user needs per-user counters. At scale (10M users), this state lives in the framework (e.g. Flink's RocksDB-backed state) and is checkpointed to durable storage for restart recovery.
- Exactly-once semantics on processing. Counting each event exactly once across restarts requires checkpoint+commit coordination. Flink and Kafka Streams provide this; rolling your own is hard.
Framework choices:
- Kafka Streams — library, not a cluster. Embed in your service. Best when you're already on Kafka and the processing fits in your service's resource budget.
- Apache Flink — full-blown distributed processor. Heaviest, most powerful. The default choice for serious streaming workloads.
- ksqlDB — SQL on Kafka streams. Lightweight, declarative, less powerful than Flink for complex flows.
- Apache Spark Streaming — micro-batched. Easier to reason about; not truly real-time (seconds of latency rather than sub-second).
Reach for streaming when: the value of the result decays with time (fraud detection has minutes; daily reports have hours), you genuinely have continuous data, and a batch every 5 minutes is too slow. Don't reach for streaming when a daily batch suffices — the operational cost is higher than people remember.
Event sourcing and CQRS
Most systems store the current state of things — the user has 47 points, the order is "shipped". Event sourcing flips that: the source of truth is the log of every event that ever happened. "User signed up", "User earned 10 points", "User earned 5 points"... and the current state is what you get by replaying all the events.
Traditional state Event-sourced
───────────────── ─────────────
row: user 42 log for user 42:
balance: 47 [signed_up, earned 10, earned 5,
earned 50, redeemed 18, earned 0,
... derive balance = 47]
Why this is interesting:
- Full audit trail. Every change is recorded with cause and timestamp. Disputes ("why is my balance 47, I should have 50") are answerable from the log.
- Time travel / what-if. Replay events up to any past timestamp; see what the state was. Apply a bug fix to event processing and re-derive corrected state from history.
- Multiple read models from one source of truth. The same event stream feeds the user-facing balance view, the fraud-detection model, the data warehouse, the cache — each derives its own representation.
The costs are real:
- Queries are derived, not direct. To get the current balance, you replay events. Without a cached projection, every read is a fold over the log. Solution: maintain projections (materialised views) that update on each new event.
- Schema evolution gets ugly. Old events were written in v1's shape; new code expects v2. Either keep old code that can read v1, or upcast old events to v2 on read.
- Operationally heavier. Event store, projections, replay infrastructure — more moving parts than a single CRUD table.
CQRS (Command Query Responsibility Segregation) is the closely-related pattern that splits the model used for writes (commands) from the model used for reads (queries). The write side might be normalised, transactional, and small; the read side might be denormalised, eventually consistent, and optimised for the specific UI shapes that need it.
Command ──► validate ──► append events ──► event store
│
▼
projection workers
│
┌───────────────────────┼───────────────────┐
▼ ▼ ▼
read model A (Postgres) read model B (Elasticsearch) ...
▲ ▲
│ │
Query Query
CQRS does not require event sourcing — you can have CQRS with a regular database — but the two pair naturally. Event sourcing gives you the canonical write log; CQRS lets you build whatever read models you want from it.
When to use either:
- Event sourcing: complex business workflows where audit and time-travel are valuable (financial transactions, healthcare records, supply chain). Avoid for simple CRUD; it is heavy.
- CQRS: when reads and writes have wildly different shapes/loads. A search-heavy app whose data writes are simple. An admin panel whose reads need joins across many domains.
Messaging and event-driven architecture are the connective tissue of every large system. The patterns in this module — queues, Kafka, pub/sub, streams, event sourcing — show up in different combinations in nearly every system you'll design from here on. The next module zooms back out to architecture-level patterns: how services fit together, how they fail well, and how they recover.
⁂ Back to all modules