Home
High-Level System Design / Module 4 — Networking & Infrastructure

Networking & Infrastructure

Load balancing, DNS and CDNs, API gateways, HTTP/1.1/2/3, WebSockets, service discovery, service mesh — the plumbing that connects every service.


What "infrastructure" actually means here

Every service in a distributed system is reachable through layers of plumbing the application code never sees directly. The user types a URL; DNS turns it into an IP; the request hits a load balancer; the load balancer routes to one of N application servers; that server makes a call to another service through service discovery and a mesh proxy; somewhere a CDN cached the response and the request never made it past the edge.

This module is about that plumbing. None of it is glamorous, all of it is load-bearing, and the choices you make here determine whether your architecture costs $200/month or $200,000/month at the same traffic level. The themes are the same across every layer: distribute work, fail over fast, keep latency-killing round trips to a minimum, and put the right cache at the right hop.

Load balancing

A load balancer takes traffic destined for a service and distributes it across multiple backend instances of that service. It is the single most important component in any horizontally-scaled architecture, for three reasons:

Load balancers operate at one of two layers of the OSI model:

Layer 4 (transport-level). Routes based on IP and port. Doesn't understand HTTP. Forwards raw TCP/UDP packets. Examples: AWS NLB, GCP TCP/UDP load balancer, HAProxy in TCP mode, IPVS, kube-proxy. Pros: extremely fast (just packet forwarding), works for any TCP-based protocol. Cons: no per-request routing, can't read headers, can't terminate TLS easily.

Layer 7 (application-level). Understands HTTP (or gRPC, or whatever your protocol is). Routes based on URL path, hostname, headers, cookies. Examples: AWS ALB, GCP HTTPS LB, NGINX, Envoy, HAProxy in HTTP mode, Cloudflare. Pros: rich routing (path-based, header-based, weighted, canary), TLS termination, gzip, response transformations. Cons: slightly slower per request than L4, more memory per connection.

Most modern systems use L7 in front of L4 — a global L4 anycast layer (cheap, fast, DDoS-resistant) terminates TCP and forwards to a regional L7 layer that does the smart routing.

The routing algorithm is how the load balancer picks which backend gets the next request. Round-robin is the simplest — backend 1, backend 2, backend 3, repeat. Least-connections favours backends with fewer active connections — better when requests have variable duration. Weighted variants let you send 90% of traffic to the stable version and 10% to a canary. Consistent hashing routes the same key (e.g. user ID) to the same backend, useful when backends maintain per-user state.

Health checks are how the balancer knows a backend is alive. Active checks: the balancer periodically pings a /health endpoint and removes any that fail. Passive checks: the balancer notices that requests to a backend are timing out or returning 5xx and stops routing to it. Both are needed in practice. The classic mistake is a /health endpoint that only checks the process is running — meanwhile the database is unreachable and every real request returns 500. The health check should exercise the critical downstream dependencies.

Sticky sessions route the same client to the same backend (via cookie, IP, or hashed header). It is sometimes necessary (legacy session-state-in-memory apps) but it is also a code smell — it ties your scaling to client behaviour and makes graceful shutdown harder. Prefer stateless backends with a shared session store.

DNS and CDN architecture

DNS turns names into IP addresses. A user types electrominds.in; their resolver walks the DNS hierarchy (root → .inelectrominds.in) and returns an IP. The request then goes to that IP. The whole process is invisible to the user, takes 10-50 ms, and is one of the most-cached pieces of infrastructure on the planet.

The DNS records you'll meet in practice:

Two properties of DNS that matter for architecture:

CDNs build on this. A CDN operates hundreds of edge POPs (Points of Presence) around the world. Your CDN-fronted hostname resolves, via Anycast or GeoDNS, to the nearest POP. The POP serves the request from its cache; on miss, it fetches from your origin.

text
   User in Mumbai                          User in Berlin
        │                                       │
        ▼                                       ▼
   Mumbai POP                              Frankfurt POP
        │                                       │
        └────── (miss) ──► Origin server ◄──── (miss) ──┘
                          (one place)

Key design points for CDN architecture:

The combination of DNS + CDN means a global, fast, attack-resistant front door for your application that you would not have built yourself. For any consumer-facing service, this layer should be the default, not the optimisation.

API gateway

An API gateway sits between clients and your backend services. It is the single entry point for an API or for an entire microservices architecture. The gateway handles concerns that would otherwise be duplicated across every service: authentication, rate limiting, request routing, request/response transformation, observability, and contract enforcement.

The shape is familiar:

text
   Client ──► API Gateway ──┬──► Service A
                            ├──► Service B
                            └──► Service C

What the gateway typically owns:

Common implementations: AWS API Gateway, Kong, Apigee, Tyk, KrakenD, Envoy (often via Istio), NGINX with the right config. For a simple HTTP front-end with auth and rate-limiting, NGINX or Envoy + a small bit of Lua is enough. For OAuth flows, developer portals, and per-customer plans, a managed service or Kong pays off.

Backend for Frontend (BFF) is a closely related pattern: instead of one gateway for all clients, you have a gateway per client type. The mobile BFF returns a slim payload; the web BFF includes admin-only fields; the partner BFF speaks a different contract. Each BFF can be owned by the team that owns that client.

A caution: an API gateway can drift into being a god service that contains business logic. Resist this. The gateway's job is plumbing — cross-cutting concerns and routing. Domain logic stays in the services. If you find yourself implementing "if the user's plan is premium, fetch the premium-data service" in the gateway, you have crossed the line.

HTTP/1.1 vs HTTP/2 vs HTTP/3

HTTP has had three major revisions in three decades, and each one changed how clients and servers should be architected. The differences matter because they decide how much concurrency you get per connection, how many connections clients open, and how latency-sensitive your design needs to be.

HTTP/1.1 (1997). One request per connection at a time. Pipelining exists in spec but is broken in practice. Browsers open 6-8 parallel TCP connections per origin to fake concurrency. Each connection is its own TCP handshake and (since HTTPS) its own TLS handshake — adding hundreds of milliseconds on a cold connection.

Key property: head-of-line blocking on the connection. If request 1 is slow, requests 2-N on the same connection wait.

HTTP/2 (2015). Single TCP connection per origin, with many concurrent streams multiplexed inside it. Binary framing instead of text. Server push (now deprecated in browsers). Header compression (HPACK).

text
   HTTP/1.1: 6 connections × 1 request at a time
   ┌───┐ ┌───┐ ┌───┐ ┌───┐ ┌───┐ ┌───┐
   │R1 │ │R2 │ │R3 │ │R4 │ │R5 │ │R6 │
   └───┘ └───┘ └───┘ └───┘ └───┘ └───┘
   ↳ 6 TCP setups, 6 TLS handshakes

   HTTP/2: 1 connection × N concurrent streams
   ┌─────────────────────────────────────┐
   │ stream 1: R1 ──────────────────────│
   │ stream 2: R2 ──────────────────────│
   │ stream 3: R3 ──────────────────────│
   │ stream 4: R4 ──────────────────────│
   └─────────────────────────────────────┘
   ↳ 1 TCP setup, 1 TLS handshake, full concurrency

For API traffic and microservices (think gRPC, which runs on HTTP/2), this is a transformative speedup. The remaining issue: head-of-line blocking on the TCP layer. If one packet is lost, the whole TCP connection stalls until it is retransmitted, even though only one stream needed that packet. TCP doesn't know about streams.

HTTP/3 (2022). Runs on QUIC instead of TCP. QUIC is a new transport protocol built on UDP, with TLS 1.3 baked in. Each stream has its own ordering, so a lost packet on stream 1 doesn't stall stream 2. The TLS handshake is folded into the QUIC handshake; in many cases you get 0-RTT connection setup with a server you have talked to before.

Key property for designers: connection migration. A QUIC connection is identified by a connection ID, not by an IP/port four-tuple like TCP. If your phone switches from wifi to LTE, the connection survives the IP change. Long-lived QUIC sessions to a mobile app are practical in a way TCP ones were not.

When to care about which version, in practice:

The practical rule: enable HTTP/2 everywhere on the public-facing edge today; HTTP/3 wherever your CDN supports it; HTTP/2 internally between services. Don't go out of your way to support HTTP/1.1 for new internal traffic — it costs you concurrency.

WebSockets and real-time communication

Request/response is the bread-and-butter pattern of the web, but it cannot do everything. Some features need the server to push to the client without the client asking — chat, live notifications, collaborative editing, real-time dashboards, multiplayer game state. There are four ways to do this; understanding the tradeoffs decides which one fits.

1. Polling. The client asks every few seconds. "Anything new?" "No." "Anything new?" "No." Trivial to implement; terrible at scale. 10,000 clients polling every 5 seconds is 2,000 RPS of pointless work.

2. Long polling. The client asks; the server holds the request open until it has something to send (or until a timeout). Better than naive polling, but each notification still costs an HTTP round trip, and the server has to manage many slow connections.

3. Server-Sent Events (SSE). A standard HTTP response that the server keeps open, sending text events one after another over the same connection. One-way (server → client). Built on plain HTTP, so it traverses every proxy and load balancer that exists. The browser API is EventSource. Use for: notifications, live updates, dashboard streams — anything one-way.

javascript
const es = new EventSource('/events');
es.onmessage = (e) => updateUI(JSON.parse(e.data));

4. WebSockets. A full-duplex (bidirectional) connection upgraded from an HTTP request. Once established, either side can send messages at any time, with minimal overhead per message. The protocol is ws:// or wss:// (encrypted). Use for: chat, collaborative editing, multiplayer games, anything where the client also has frequent things to say.

The architecture cost of long-lived connections is what people underestimate. Each open WebSocket is a TCP connection, a TLS session, server memory, a file descriptor, and (often) a slot in your load balancer's connection table. 100,000 concurrent WebSockets is a non-trivial deployment.

Key concerns:

text
   Client A ──► Server 1 ──┐
                            ├──► Redis Pub/Sub ──► fan-out
   Client B ──► Server 2 ──┘

The pragmatic rule: pick SSE if you only need server → client. Pick WebSockets only when you genuinely need bidirectional, frequent, low-latency messaging. The implementation and operational complexity of WebSockets is significantly higher.

Service discovery

In a microservices architecture, services need to find each other. "The orders service needs to call the inventory service" requires knowing which IP and port the inventory service is currently running on — and the answer changes every time an instance is added, removed, or moves to a different host.

Hard-coded IPs work for two services on two known machines. Past that, you need a discovery mechanism.

Two patterns:

1. Client-side discovery. The calling service queries a registry, gets a list of instances, and picks one (using its own load balancing logic). Examples: Netflix Eureka, Consul with a smart client.

text
   Client ──► Registry ──► [10.0.1.5:8080, 10.0.1.7:8080]
                                       │
                                       ▼
                                pick one, call it

2. Server-side discovery. The client calls a virtual endpoint; a load balancer or proxy resolves it to a backend. Examples: Kubernetes Services, AWS ALB target groups, Envoy with a service registry.

text
   Client ──► inventory-svc:80 ──► (proxy looks up) ──► backend

Server-side is more common today because the orchestrator (Kubernetes) takes care of it. The client just calls http://inventory-svc:8080 and Kubernetes's kube-proxy handles the IP routing.

The registry itself is a state store with strong consistency requirements (you don't want two clients seeing different views of the live instances). Common backends: etcd (used by Kubernetes), Consul, ZooKeeper. Their consensus algorithm — Raft, covered in Module 7 — is the same machinery you'd use to build the registry from scratch.

Health checks drive what shows up in the registry. An instance registers itself on startup, sends regular heartbeats, and is removed if heartbeats stop. The registry's consensus protocol ensures every other node sees the same removal at roughly the same time.

DNS as service discovery. Some setups skip the dedicated registry and use DNS — inventory.svc.internal returns a list of A records, one per healthy instance. Kubernetes does exactly this via CoreDNS. The downside is DNS caching: a stale resolver gets stale answers. Short TTLs partially fix this, but pure-DNS service discovery is a tradeoff against the explicit registry approach.

Service mesh — when proxies become a layer

A service mesh is a layer of network proxies, one per service instance, that handles all in-cluster service-to-service traffic. Each application doesn't talk directly to its peers — it talks to a local proxy (the "sidecar") which talks to the peer's proxy. The proxies handle TLS, retries, timeouts, circuit breaking, observability, and traffic shaping — uniformly, for every service, without touching application code.

text
   Service A pod                 Service B pod
   ┌─────────────────┐           ┌─────────────────┐
   │  app  │ sidecar │ ──TLS──►  │ sidecar │  app  │
   └─────────────────┘           └─────────────────┘
        ▲       │
        │       └── traffic policy from control plane
        │
   Control plane (Istio, Linkerd, Consul Connect)

What you get from a mesh:

The cost is real:

When a mesh is worth it: many services (~20+), strong security requirements (mTLS, zero-trust networking), polyglot stacks where you can't standardise a single retry library across languages, complex traffic shaping for canary deploys. Below that bar, an API gateway plus a good HTTP client library in each service does most of the same work with much less operational cost.

Kubernetes is where most of this lives today. The cluster handles service discovery (DNS), load balancing (kube-proxy + Services), and deployment lifecycle (Deployments, Pods). The mesh runs on top of Kubernetes; the service registry and basic LB are below the mesh. Knowing which problem each layer solves stops the mistake of trying to make Kubernetes do mesh things, or vice versa.

The summary picture, top to bottom: CDN at the edge, DNS + global LB to direct to a region, API gateway in front of the cluster, service mesh inside the cluster, services talking to each other through proxies. Each layer earns its place by solving one well-defined problem. The next module is about how services talk asynchronously — queues, Kafka, pub/sub, streams — for the cases where direct HTTP calls are the wrong shape.


⁂ Back to all modules