Module 4 • Architecture • 22 min read • Updated May 21, 2026

Networking & Infrastructure

Load balancing, DNS and CDNs, API gateways, HTTP/1.1/2/3, WebSockets, service discovery, service mesh — the plumbing that connects every service.

What "infrastructure" actually means here

Every service in a distributed system is reachable through layers of plumbing the application code never sees directly. The user types a URL; DNS turns it into an IP; the request hits a load balancer; the load balancer routes to one of N application servers; that server makes a call to another service through service discovery and a mesh proxy; somewhere a CDN cached the response and the request never made it past the edge.

This module is about that plumbing. None of it is glamorous, all of it is load-bearing, and the choices you make here determine whether your architecture costs $200/month or $200,000/month at the same traffic level. The themes are the same across every layer: distribute work, fail over fast, keep latency-killing round trips to a minimum, and put the right cache at the right hop.

Load balancing

A load balancer takes traffic destined for a service and distributes it across multiple backend instances of that service. It is the single most important component in any horizontally-scaled architecture, for three reasons:

It removes the SPOF of a single backend. Even if you have ten app servers, you need something in front to route around the dead ones.
It enables zero-downtime deploys. Drain traffic from one instance, deploy the new version, route traffic back. Repeat across the fleet.
It is where TLS termination, request logging, basic rate limiting, and health checking belong — central choke-points where one piece of config touches all traffic.

Load balancers operate at one of two layers of the OSI model:

Layer 4 (transport-level). Routes based on IP and port. Doesn't understand HTTP. Forwards raw TCP/UDP packets. Examples: AWS NLB, GCP TCP/UDP load balancer, HAProxy in TCP mode, IPVS, kube-proxy. Pros: extremely fast (just packet forwarding), works for any TCP-based protocol. Cons: no per-request routing, can't read headers, can't terminate TLS easily.

Layer 7 (application-level). Understands HTTP (or gRPC, or whatever your protocol is). Routes based on URL path, hostname, headers, cookies. Examples: AWS ALB, GCP HTTPS LB, NGINX, Envoy, HAProxy in HTTP mode, Cloudflare. Pros: rich routing (path-based, header-based, weighted, canary), TLS termination, gzip, response transformations. Cons: slightly slower per request than L4, more memory per connection.

Most modern systems use L7 in front of L4 — a global L4 anycast layer (cheap, fast, DDoS-resistant) terminates TCP and forwards to a regional L7 layer that does the smart routing.

The routing algorithm is how the load balancer picks which backend gets the next request. Round-robin is the simplest — backend 1, backend 2, backend 3, repeat. Least-connections favours backends with fewer active connections — better when requests have variable duration. Weighted variants let you send 90% of traffic to the stable version and 10% to a canary. Consistent hashing routes the same key (e.g. user ID) to the same backend, useful when backends maintain per-user state.

Health checks are how the balancer knows a backend is alive. Active checks: the balancer periodically pings a /health endpoint and removes any that fail. Passive checks: the balancer notices that requests to a backend are timing out or returning 5xx and stops routing to it. Both are needed in practice. The classic mistake is a /health endpoint that only checks the process is running — meanwhile the database is unreachable and every real request returns 500. The health check should exercise the critical downstream dependencies.

Sticky sessions route the same client to the same backend (via cookie, IP, or hashed header). It is sometimes necessary (legacy session-state-in-memory apps) but it is also a code smell — it ties your scaling to client behaviour and makes graceful shutdown harder. Prefer stateless backends with a shared session store.

DNS and CDN architecture

DNS turns names into IP addresses. A user types electrominds.in; their resolver walks the DNS hierarchy (root → .in → electrominds.in) and returns an IP. The request then goes to that IP. The whole process is invisible to the user, takes 10-50 ms, and is one of the most-cached pieces of infrastructure on the planet.

The DNS records you'll meet in practice:

A / AAAA — name maps to IPv4 / IPv6 address.
CNAME — name maps to another name. Useful for pointing your subdomain at a third-party service (api.example.com → api.example.cloudfront.net).
MX — mail server for the domain.
TXT — arbitrary text. Used for SPF, DKIM, domain verification by SaaS services.
NS — which DNS servers are authoritative for this zone.

Two properties of DNS that matter for architecture:

Caching. Every record has a TTL. Resolvers cache the answer for that long. A 5-minute TTL means changes take up to 5 minutes to propagate. A 24-hour TTL means a misconfigured record is your problem for a day. Use short TTLs (60-300 seconds) for anything that might fail over; longer TTLs for stable records.
GeoDNS / latency-based routing. Modern authoritative DNS (Route 53, NS1, Cloudflare) can return different IPs to different resolvers based on the resolver's geographic location or measured latency. This is how a single hostname can route users to the nearest of your three regional deployments.

CDNs build on this. A CDN operates hundreds of edge POPs (Points of Presence) around the world. Your CDN-fronted hostname resolves, via Anycast or GeoDNS, to the nearest POP. The POP serves the request from its cache; on miss, it fetches from your origin.

text

   User in Mumbai                          User in Berlin
        │                                       │
        ▼                                       ▼
   Mumbai POP                              Frankfurt POP
        │                                       │
        └────── (miss) ──► Origin server ◄──── (miss) ──┘
                          (one place)

Key design points for CDN architecture:

Static assets (images, JS, CSS, fonts, video) — cache aggressively, ideally with content-hashed filenames so cache TTLs can be "forever."
HTML and API responses — cache more carefully, with short TTLs and clear cache keys.
TLS termination at the edge. The CDN terminates TLS for you and re-encrypts (or doesn't) to your origin. This keeps cert management central and lets the CDN serve from already-warm TLS sessions.
DDoS protection. Edge POPs absorb attack traffic with capacity your origin will never have. Cloudflare and AWS Shield routinely absorb terabit-scale attacks.
WAF (Web Application Firewall). Block bad requests at the edge (SQL injection patterns, known bad IPs, regex matches on request bodies). Cheap insurance.

The combination of DNS + CDN means a global, fast, attack-resistant front door for your application that you would not have built yourself. For any consumer-facing service, this layer should be the default, not the optimisation.

API gateway

An API gateway sits between clients and your backend services. It is the single entry point for an API or for an entire microservices architecture. The gateway handles concerns that would otherwise be duplicated across every service: authentication, rate limiting, request routing, request/response transformation, observability, and contract enforcement.

The shape is familiar:

text

   Client ──► API Gateway ──┬──► Service A
                            ├──► Service B
                            └──► Service C

What the gateway typically owns:

Auth. Verify the JWT or API key once at the gateway; pass user identity downstream as a trusted header. Services don't each implement auth.
Rate limiting. Per-API-key, per-user, per-endpoint. Sliding window or token bucket. Reject before the request reaches a backend.
Routing. Path-based (/users/* → users-svc), header-based (versioning via Accept: application/vnd.api.v2+json), or weighted (canary deployments).
Request transformation. Strip internal headers, rewrite paths, add tracing IDs, transform protocols (REST in, gRPC out).
Response aggregation (the BFF pattern). Combine responses from multiple services into one client-facing response. Reduces round trips on bandwidth-constrained mobile clients.
Observability. Central place to log every request, emit metrics per endpoint, and emit distributed tracing spans.

Common implementations: AWS API Gateway, Kong, Apigee, Tyk, KrakenD, Envoy (often via Istio), NGINX with the right config. For a simple HTTP front-end with auth and rate-limiting, NGINX or Envoy + a small bit of Lua is enough. For OAuth flows, developer portals, and per-customer plans, a managed service or Kong pays off.

Backend for Frontend (BFF) is a closely related pattern: instead of one gateway for all clients, you have a gateway per client type. The mobile BFF returns a slim payload; the web BFF includes admin-only fields; the partner BFF speaks a different contract. Each BFF can be owned by the team that owns that client.

A caution: an API gateway can drift into being a god service that contains business logic. Resist this. The gateway's job is plumbing — cross-cutting concerns and routing. Domain logic stays in the services. If you find yourself implementing "if the user's plan is premium, fetch the premium-data service" in the gateway, you have crossed the line.

HTTP/1.1 vs HTTP/2 vs HTTP/3

HTTP has had three major revisions in three decades, and each one changed how clients and servers should be architected. The differences matter because they decide how much concurrency you get per connection, how many connections clients open, and how latency-sensitive your design needs to be.

HTTP/1.1 (1997). One request per connection at a time. Pipelining exists in spec but is broken in practice. Browsers open 6-8 parallel TCP connections per origin to fake concurrency. Each connection is its own TCP handshake and (since HTTPS) its own TLS handshake — adding hundreds of milliseconds on a cold connection.

Key property: head-of-line blocking on the connection. If request 1 is slow, requests 2-N on the same connection wait.

HTTP/2 (2015). Single TCP connection per origin, with many concurrent streams multiplexed inside it. Binary framing instead of text. Server push (now deprecated in browsers). Header compression (HPACK).

text

   HTTP/1.1: 6 connections × 1 request at a time
   ┌───┐ ┌───┐ ┌───┐ ┌───┐ ┌───┐ ┌───┐
   │R1 │ │R2 │ │R3 │ │R4 │ │R5 │ │R6 │
   └───┘ └───┘ └───┘ └───┘ └───┘ └───┘
   ↳ 6 TCP setups, 6 TLS handshakes

   HTTP/2: 1 connection × N concurrent streams
   ┌─────────────────────────────────────┐
   │ stream 1: R1 ──────────────────────│
   │ stream 2: R2 ──────────────────────│
   │ stream 3: R3 ──────────────────────│
   │ stream 4: R4 ──────────────────────│
   └─────────────────────────────────────┘
   ↳ 1 TCP setup, 1 TLS handshake, full concurrency

For API traffic and microservices (think gRPC, which runs on HTTP/2), this is a transformative speedup. The remaining issue: head-of-line blocking on the TCP layer. If one packet is lost, the whole TCP connection stalls until it is retransmitted, even though only one stream needed that packet. TCP doesn't know about streams.

HTTP/3 (2022). Runs on QUIC instead of TCP. QUIC is a new transport protocol built on UDP, with TLS 1.3 baked in. Each stream has its own ordering, so a lost packet on stream 1 doesn't stall stream 2. The TLS handshake is folded into the QUIC handshake; in many cases you get 0-RTT connection setup with a server you have talked to before.

Key property for designers: connection migration. A QUIC connection is identified by a connection ID, not by an IP/port four-tuple like TCP. If your phone switches from wifi to LTE, the connection survives the IP change. Long-lived QUIC sessions to a mobile app are practical in a way TCP ones were not.

When to care about which version, in practice:

Public web traffic — your CDN does this for you; just make sure HTTP/2 and HTTP/3 are enabled. The clients are modern browsers.
Server-to-server (especially gRPC) — HTTP/2 is standard; HTTP/3 is starting to appear (gRPC over QUIC) but adoption lags.
Long-lived client connections (WebSockets, SSE) — discussed in the next section. These don't benefit from HTTP/2 multiplexing the way short request/response cycles do.
Origin connections — if your gateway talks to ten microservices, use HTTP/2 between them; you save a lot of TCP setup churn under load.

The practical rule: enable HTTP/2 everywhere on the public-facing edge today; HTTP/3 wherever your CDN supports it; HTTP/2 internally between services. Don't go out of your way to support HTTP/1.1 for new internal traffic — it costs you concurrency.

WebSockets and real-time communication

Request/response is the bread-and-butter pattern of the web, but it cannot do everything. Some features need the server to push to the client without the client asking — chat, live notifications, collaborative editing, real-time dashboards, multiplayer game state. There are four ways to do this; understanding the tradeoffs decides which one fits.

1. Polling. The client asks every few seconds. "Anything new?" "No." "Anything new?" "No." Trivial to implement; terrible at scale. 10,000 clients polling every 5 seconds is 2,000 RPS of pointless work.

2. Long polling. The client asks; the server holds the request open until it has something to send (or until a timeout). Better than naive polling, but each notification still costs an HTTP round trip, and the server has to manage many slow connections.

3. Server-Sent Events (SSE). A standard HTTP response that the server keeps open, sending text events one after another over the same connection. One-way (server → client). Built on plain HTTP, so it traverses every proxy and load balancer that exists. The browser API is EventSource. Use for: notifications, live updates, dashboard streams — anything one-way.

javascript

const es = new EventSource('/events');
es.onmessage = (e) => updateUI(JSON.parse(e.data));

4. WebSockets. A full-duplex (bidirectional) connection upgraded from an HTTP request. Once established, either side can send messages at any time, with minimal overhead per message. The protocol is ws:// or wss:// (encrypted). Use for: chat, collaborative editing, multiplayer games, anything where the client also has frequent things to say.

The architecture cost of long-lived connections is what people underestimate. Each open WebSocket is a TCP connection, a TLS session, server memory, a file descriptor, and (often) a slot in your load balancer's connection table. 100,000 concurrent WebSockets is a non-trivial deployment.

Key concerns:

Connection limits. Each Linux process has a file-descriptor limit (often 65k by default; raise it). Each load balancer has connection limits and per-target-instance limits.
Sticky routing. A WebSocket can't be load-balanced per-message — once connected, the client is bound to that one backend instance. If that instance dies, the client reconnects (and ideally that reconnect routes to a different instance).
Horizontal scaling needs a backplane. If two clients are in the same chat room but connected to different backend instances, the instances need a way to share messages. Redis Pub/Sub, Kafka, or NATS commonly fill this role.

text

   Client A ──► Server 1 ──┐
                            ├──► Redis Pub/Sub ──► fan-out
   Client B ──► Server 2 ──┘

Heartbeats and reconnection. Networks drop idle connections. Send a small ping every 30 seconds to detect dead connections and trigger reconnection. The client side needs exponential backoff on reconnect to avoid thundering-herding the server after an outage.

The pragmatic rule: pick SSE if you only need server → client. Pick WebSockets only when you genuinely need bidirectional, frequent, low-latency messaging. The implementation and operational complexity of WebSockets is significantly higher.

Service discovery

In a microservices architecture, services need to find each other. "The orders service needs to call the inventory service" requires knowing which IP and port the inventory service is currently running on — and the answer changes every time an instance is added, removed, or moves to a different host.

Hard-coded IPs work for two services on two known machines. Past that, you need a discovery mechanism.

Two patterns:

1. Client-side discovery. The calling service queries a registry, gets a list of instances, and picks one (using its own load balancing logic). Examples: Netflix Eureka, Consul with a smart client.

text

   Client ──► Registry ──► [10.0.1.5:8080, 10.0.1.7:8080]
                                       │
                                       ▼
                                pick one, call it

2. Server-side discovery. The client calls a virtual endpoint; a load balancer or proxy resolves it to a backend. Examples: Kubernetes Services, AWS ALB target groups, Envoy with a service registry.

text

   Client ──► inventory-svc:80 ──► (proxy looks up) ──► backend

Server-side is more common today because the orchestrator (Kubernetes) takes care of it. The client just calls http://inventory-svc:8080 and Kubernetes's kube-proxy handles the IP routing.

The registry itself is a state store with strong consistency requirements (you don't want two clients seeing different views of the live instances). Common backends: etcd (used by Kubernetes), Consul, ZooKeeper. Their consensus algorithm — Raft, covered in Module 7 — is the same machinery you'd use to build the registry from scratch.

Health checks drive what shows up in the registry. An instance registers itself on startup, sends regular heartbeats, and is removed if heartbeats stop. The registry's consensus protocol ensures every other node sees the same removal at roughly the same time.

DNS as service discovery. Some setups skip the dedicated registry and use DNS — inventory.svc.internal returns a list of A records, one per healthy instance. Kubernetes does exactly this via CoreDNS. The downside is DNS caching: a stale resolver gets stale answers. Short TTLs partially fix this, but pure-DNS service discovery is a tradeoff against the explicit registry approach.

Service mesh — when proxies become a layer

A service mesh is a layer of network proxies, one per service instance, that handles all in-cluster service-to-service traffic. Each application doesn't talk directly to its peers — it talks to a local proxy (the "sidecar") which talks to the peer's proxy. The proxies handle TLS, retries, timeouts, circuit breaking, observability, and traffic shaping — uniformly, for every service, without touching application code.

text

   Service A pod                 Service B pod
   ┌─────────────────┐           ┌─────────────────┐
   │  app  │ sidecar │ ──TLS──►  │ sidecar │  app  │
   └─────────────────┘           └─────────────────┘
        ▲       │
        │       └── traffic policy from control plane
        │
   Control plane (Istio, Linkerd, Consul Connect)

What you get from a mesh:

Mutual TLS everywhere. Every service-to-service call is encrypted and authenticated, without each app implementing TLS. Identity comes from the platform (a cryptographic identity per pod).
Uniform retries, timeouts, circuit breakers. Configured centrally, applied per route. (These patterns are covered in Module 6.)
Traffic shaping. Route 5% of requests to v2 of a service; mirror traffic to a test environment; do canary rollouts.
Observability. Every proxy emits the same metrics (request count, latency, error rate), tags spans with the same conventions, and produces a unified picture of the topology.

The cost is real:

Operational complexity. The control plane is another system to run, upgrade, and reason about. Istio's reputation for being heavy is earned.
Latency overhead. Each request now goes through two extra proxies. Linkerd's overhead is ~1 ms; Istio's depends heavily on config. For a 5-ms internal call, an extra 1-2 ms is significant.
Resource overhead. A sidecar per pod is ~50-100 MB of memory and some CPU. At 1,000 pods, that's significant.

When a mesh is worth it: many services (~20+), strong security requirements (mTLS, zero-trust networking), polyglot stacks where you can't standardise a single retry library across languages, complex traffic shaping for canary deploys. Below that bar, an API gateway plus a good HTTP client library in each service does most of the same work with much less operational cost.

Kubernetes is where most of this lives today. The cluster handles service discovery (DNS), load balancing (kube-proxy + Services), and deployment lifecycle (Deployments, Pods). The mesh runs on top of Kubernetes; the service registry and basic LB are below the mesh. Knowing which problem each layer solves stops the mistake of trying to make Kubernetes do mesh things, or vice versa.

The summary picture, top to bottom: CDN at the edge, DNS + global LB to direct to a region, API gateway in front of the cluster, service mesh inside the cluster, services talking to each other through proxies. Each layer earns its place by solving one well-defined problem. The next module is about how services talk asynchronously — queues, Kafka, pub/sub, streams — for the cases where direct HTTP calls are the wrong shape.

⁂ Back to all modules

Networking & Infrastructure

What "infrastructure" actually means here

Load balancing

DNS and CDN architecture

API gateway

HTTP/1.1 vs HTTP/2 vs HTTP/3

WebSockets and real-time communication

Service discovery

Service mesh — when proxies become a layer

Continue reading

Messaging & Event Systems

Architecture Patterns

Distributed Systems