Classic System Designs
URL shortener, rate limiter, newsfeed, chat, file upload, ride-sharing, video streaming, payments, distributed ID generator — the canonical interview designs done seriously.
How to approach a system design interview
The nine designs in this module are the canonical questions interviewers ask, but the value is not in memorising them. It's in the process — the order in which you reason through any new design. Every one of these follows the same skeleton:
1. Clarify requirements (functional + non-functional)
2. Back-of-envelope estimation (QPS, storage, bandwidth)
3. API design (key endpoints, request/response)
4. Data model + storage choice (per the access patterns)
5. High-level architecture (services, queues, caches, LB)
6. Deep dive on the hardest 1-2 parts
7. Identify bottlenecks; scale them
8. Failure modes; what breaks first
The single most common mistake is jumping to step 5 without doing 1-3. You end up designing a beautiful system for a different problem than the interviewer asked about.
The other common mistake is treating each design as a memorised template. Every real system has tradeoffs the template doesn't cover. "Design Twitter" for 200M DAU is different from "design Twitter" for the first 100k users. State your assumptions and let them drive the design.
The sections below run the skeleton on each canonical problem, deep enough that you can actually answer it under pressure, short enough that the patterns are visible. The patterns repeat — write-heavy fan-out, hot keys, sharded counters, cache stampedes, eventual consistency. By the ninth design they should feel familiar.
URL shortener
Requirements. Convert long URLs to short ones (bit.ly/abc123). Redirect short to long. ~100M new URLs/month, ~10B redirects/month. Custom aliases optional. Analytics optional but common.
Estimation.
New URLs: 100M/month ≈ 40/sec (peak ~400/sec)
Redirects: 10B/month ≈ 4000/sec (peak ~40,000/sec)
Read:write ratio ≈ 100:1
Storage: 500 bytes/URL × 100M/month = 50 GB/month → 600 GB/year
Key insight. Reads dominate by two orders of magnitude. Optimise for read latency and read scale. Writes are nothing.
API.
POST /shorten body: {long_url, [custom_alias], [expires_at]}
→ 201 {short_code}
GET /:code → 302 redirect to long_url
Generating short codes — the design choice that defines this system.
Three approaches:
- Base62 of an auto-incrementing ID. Counter goes 1, 2, 3...; encoded in base62 [0-9a-zA-Z], gives
b,c...zzzzz(7 chars covers 3.5 trillion URLs). Pros: short codes; no collisions. Cons: predictable (someone can enumerate every URL); needs a coordinated counter (DB sequence or distributed ID generator — covered later in this module). - Random base62 string with collision check. Generate 7 random characters; SELECT to check uniqueness; INSERT. Cons: extra read on every write; rare collisions retry.
- Pre-generate a pool. A background job pre-computes millions of unused short codes into a table. Shortening a URL is just "grab next from pool." Pros: write becomes O(1) with no DB-side coordination. Cons: extra service.
Most real systems use option 1 or 3. Option 2 doesn't scale.
Storage. Postgres or any RDBMS works fine for 600 GB. Single primary + read replicas for the redirect traffic. Schema:
CREATE TABLE urls (
code TEXT PRIMARY KEY,
long_url TEXT NOT NULL,
created_at TIMESTAMPTZ DEFAULT NOW(),
expires_at TIMESTAMPTZ,
owner_id BIGINT REFERENCES users(id)
);
Cache. The hot tail: the 1-5% of URLs receiving 80%+ of redirects. Cache aggressively in Redis with a long TTL (URLs almost never change). Cache-aside pattern. Hit ratio in practice: 90%+ — the Postgres replicas barely see traffic.
CDN trick. The redirect can be served by an edge worker (Cloudflare Workers, Lambda@Edge). The short code → long URL mapping lives in an edge KV store, updated when new URLs are created. Now the user's browser-to-redirect latency is one local hop. The origin only sees writes and cold-cache fetches.
Analytics. If you record every redirect, you have 4k events/sec. Stream them into Kafka, aggregate into per-day per-URL counts in a separate analytics database. Don't bloat the URL table with click counts.
Rate limiter service
Requirements. A rate limiter that other services call. Decide whether a given (user, endpoint) is within their limit. Sub-millisecond p99 latency. Survives Redis failures.
Estimation. If you front a 100k RPS API, the rate limiter sees 100k RPS itself. Memory: a counter per (user, endpoint, window) — for 10M users × 50 endpoints × current minute = 500M keys. Each key ~50 bytes → 25 GB. Fits on a small Redis cluster.
API.
POST /check body: {key, limit, window_seconds}
→ 200 {allowed: bool, remaining: N, reset_at: T}
Algorithm choice. Token bucket via Lua script in Redis. The script is atomic — read counter, check, decrement, set TTL — without race conditions:
-- KEYS[1] = bucket key, ARGV = {capacity, refill_per_sec, now}
local bucket = redis.call('HMGET', KEYS[1], 'tokens', 'updated')
local tokens = tonumber(bucket[1]) or ARGV[1]
local updated = tonumber(bucket[2]) or ARGV[3]
local elapsed = ARGV[3] - updated
tokens = math.min(ARGV[1], tokens + elapsed * ARGV[2])
if tokens >= 1 then
tokens = tokens - 1
redis.call('HMSET', KEYS[1], 'tokens', tokens, 'updated', ARGV[3])
redis.call('EXPIRE', KEYS[1], 3600)
return 1 -- allowed
else
return 0 -- denied
end
Architecture.
API ──► Rate-limiter sidecar/library ──► Redis cluster
│
└─ Local LRU cache of recent decisions (optional)
- Sidecar vs library. Library is faster (no extra network hop). Sidecar is language-agnostic. Big organisations end up with both.
- Sharded Redis. Keys are sharded by user_id (consistent hashing). One user's counter lives on one shard — no cross-shard ops needed.
- Fail-open vs fail-closed. If Redis is unreachable, do you allow all traffic (open) or deny it (closed)? For protective rate limits (DDoS), fail closed. For business limits (free tier vs paid), fail open and recover when Redis is back. State this explicitly per limit.
Distributed coordination problem. A user is allowed 100 requests/minute. They send 50 requests simultaneously to 10 different app servers. If each server checks its own cache and not the central counter, they might collectively allow 500. Two fixes:
- Always check the central counter on every request. Costs one Redis call (~0.5 ms). Best for low/medium throughput limits.
- "Leaky bucket on the client side": each app server gets a sub-quota (10/min) and checks the central counter only when its sub-quota is near zero. Probabilistic but vastly cheaper. Best for high-throughput limits where occasional small over-allowance is acceptable.
Newsfeed / timeline
Requirements. When a user opens the app, show a feed of posts from people they follow, ordered by time. 500M DAU, 5B follows, 100M posts/day. Sub-200ms feed-fetch p99.
Estimation. Reads dominate again — most users open the app and read, only a fraction post. Average user follows ~200 accounts, has ~500 followers. Distribution is heavy-tailed: a celebrity has 100M followers.
The core tension. Two strategies for assembling a feed:
1. Pull on read (fan-out on read). When user X opens app: SELECT the latest posts from each of the 200 accounts X follows; merge; sort. Pro: no work on write. Con: feed-fetch is expensive and scales with the number of follows.
2. Push on write (fan-out on write). When user Y posts: write Y's post-ID into a feed list for every one of Y's followers. Pro: feed-fetch is just SELECT * FROM feeds WHERE user_id = X — O(1). Con: a celebrity post is 100M writes — a fan-out storm.
Neither extreme works. The answer is hybrid fan-out:
Regular user posts (most users): fan-out on write
Write to followers' feed lists (fast — typical user has hundreds of followers)
Celebrity posts (huge followers): fan-out on read
Don't fan-out; mark the user as "celebrity"
Feed-fetch for X = (X's pre-computed feed) ⨁ (recent posts of celebs X follows)
Storage. Per-user feed list in a wide-column store (Cassandra) or sorted set in Redis. Each entry is (timestamp, post_id). Limit to the last few hundred entries — old posts are fetched on demand only when a user scrolls far.
Posts themselves live in a separate store (Cassandra by post_id) so the feed list stays small and lookups are O(1) per post.
Cache. The feed for active users in Redis. TTL = a few minutes for cold users, refreshed continuously for hot ones. Cache-warming for the top X% of active users keeps the cache hot ratio above 90%.
Ranking. Pure chronological is rare today — most feeds rank by predicted engagement (an ML model). The architecture is the same; an extra service computes scores when the feed is fetched (or pre-computes them for active users).
Pagination. Cursor-based, not offset. The cursor is the (timestamp, post_id) of the last item you returned. Survives new posts arriving while the user scrolls.
Chat / messaging
Requirements. 1:1 and group chat, message ordering, delivery + read receipts, presence, push notifications when offline, message history. 100M DAU, billions of messages/day.
Estimation. Average message: 100 bytes. 10B messages/day = 10 TB/day stored. Concurrent WebSocket connections: ~30M at peak (a fraction of DAU connected at once).
Architecture.
Mobile/web ◄──WebSocket──► Gateway ──► Message Service ──► Storage
│ │
│ ▼
│ Kafka
│ │
▼ ▼
Presence Service Notification
(push if offline)
Connection layer. WebSocket gateways are stateful — once a user connects, they're pinned to one gateway. The gateway holds the connection, sends/receives messages, and updates presence. 100k connections per gateway is typical; 30M connections = ~300 gateways.
Routing messages between gateways. A and B are in the same chat but connected to different gateways. When A sends a message:
1. A's gateway receives the message. 2. Persist to storage (so it's not lost). 3. Publish to Kafka topic (or NATS subject) keyed by chat:room_id. 4. B's gateway subscribes to all chats with at least one connected member; receives the message; forwards to B over the WebSocket.
If B is offline, the notification service consumes the same Kafka topic and sends a push.
Storage for messages. Cassandra is the canonical choice — write-heavy, partition by chat_id, clustering key is message_id (a time-ordered ID like ULID or Snowflake). Reads are "recent N messages in this chat" — exactly one partition, one slice.
-- Cassandra schema, simplified
CREATE TABLE messages (
chat_id UUID,
message_id TIMEUUID, -- monotonic per chat
sender_id BIGINT,
body TEXT,
PRIMARY KEY (chat_id, message_id)
) WITH CLUSTERING ORDER BY (message_id DESC);
Ordering. Per chat (one partition), Cassandra preserves the order of writes. Across chats, no global ordering — and that's fine; the chat is the unit of ordering.
Read receipts. A separate table indexed by (user_id, chat_id) tracking the last-read message_id. When a user opens a chat, update; broadcast to other members.
Group chat at scale. A group of 1000 means one message generates 1000 fan-out events. Same hybrid pattern as newsfeed — small groups fan-out eagerly, large groups (broadcast channels with millions) use a different model: messages are written once; clients pull periodically.
Presence. Heaviest part of the system if done naively. A user is "online" while they have an active WebSocket. Each gateway publishes presence updates to a Redis pub/sub keyed by user_id. Clients subscribe to presence for their contacts only. Even so, a popular user generates a lot of presence events — debounce, throttle, and don't propagate every twitch.
File upload service
Requirements. Users upload files of arbitrary size (1KB to 50GB). Files are durable, served fast worldwide. Resumable uploads. Per-file access control.
Estimation. 10M files/day at 5MB average = 50 TB/day stored, 18 PB/year. Bandwidth: 50 TB/day in → 500 TB/day out (assuming 10× read amplification).
Storage layer. Object storage (S3, GCS, Azure Blob) — module 2 covered this. Eleven 9s of durability, effectively unlimited capacity, cheap. The application's job is metadata, auth, and access control — NOT moving bytes through your servers.
The presigned URL pattern. Don't route file bytes through your servers. Generate a short-lived URL that grants the client direct write access to S3.
Client App S3
│ │ │
│── POST /upload ─────►│ │
│ │── generate presigned │
│ │ URL (PUT, 15min) ──┤
│◄── url + file_id ────│ │
│ │ │
│── PUT (bytes) ───────────────────────────► (direct)
│ │
│── POST /complete ───►│ │
│ │ (verify; mark ready) │
Why this matters. Without it, every uploaded byte traverses your app servers — bandwidth bill, request-thread saturation, slow uploads when servers are far from the user. With it, S3's edge takes the byte traffic; your app handles only metadata at ~ms each.
Multipart uploads for large files. The client splits the file into 5-100 MB chunks and uploads each separately. Pros: resumable (only retry failed chunks), parallel (saturate the user's connection), bypass per-request size limits.
POST /upload/init → upload_id
PUT /upload/{id}/parts/1 → ETag_1
PUT /upload/{id}/parts/2 → ETag_2
...
POST /upload/{id}/complete body: [ETag_1, ETag_2, ...] → finalises
S3 supports this natively. Most cloud SDKs implement it transparently.
Metadata storage. Postgres or similar. Each row: file_id, owner, S3 key, size, MIME type, sha256, upload status, access level, created_at. Indexed by owner for "list my files," by S3 key for the rare lookup-by-storage-path.
Access control. Presigned download URLs again — generated only after the app checks the user has permission. URL is valid for 10 minutes. No long-lived public URLs unless the file is intentionally public.
Deduplication. Compute the SHA-256 of the file as you upload (or as it sits in S3). If a file with the same hash already exists, point this user's metadata at the existing blob — don't store it twice. For consumer-grade dedup this saves substantial storage (everyone uploads the same memes).
Virus scanning, content moderation. A worker triggered on S3 ObjectCreated events. Scans, classifies, then either marks the file as available or quarantines it. The user's upload is acknowledged before scanning completes; the file is just pending until cleared.
Ride-sharing system
Requirements. Riders request rides; drivers nearby get matched; live location updates; ride lifecycle (request → matched → in-progress → complete → paid). Sub-second matching.
Estimation. 1M concurrent active drivers; 100k ride requests/minute peak. Location updates every 5 seconds per active driver = ~200k QPS.
The two hard problems.
1. Geo-spatial indexing. Given a rider's location, find nearby drivers fast. 2. Matching. Given a request, pick the best available driver (closest, highest-rated, lowest ETA).
Geo-indexing options.
- Geohash. Encode lat/lng as a string where shared prefixes mean geographic proximity. "dr5ru" and "dr5ry" are near each other. Store driver locations in Redis sorted sets keyed by geohash prefix; lookup is a prefix range query.
- H3 / S2. Hexagonal (Uber's H3) or quad-tree (Google's S2) hierarchical grids. Index drivers by cell ID at multiple resolutions; query the relevant cells.
- PostGIS. A relational database with full spatial indexing (GIST on geometry). Easier to start; harder to push past a few thousand QPS.
Uber uses H3. The hierarchical structure lets matching widen the search ring elastically — start with the smallest cell, expand outward until enough drivers are found.
Architecture.
Driver app ──► Location ingestion ──► Geo-index (Redis / in-memory)
│ ▲
└── every 5s position update │
│
Rider app ──► Matching service ────────────┘
│ │
│ ├── geo query: drivers within radius R
│ ├── score each (distance, ETA, rating)
│ ├── pick winner
│ ▼
│ Trip service ──► Persist trip ──► Notify driver
│
└── live updates via WebSocket
Location ingestion. 200k QPS of updates that don't all need durability. Stream into Kafka; in-memory consumers update the geo-index. The truth of the geo-index is in memory; persistence is for cold-start recovery (snapshots every few minutes).
Matching. When a rider requests:
1. Get the rider's geohash/H3 cell. 2. Query nearby drivers (typically widen rings until ~10 candidates). 3. Score each: distance, ETA from current location, driver rating, surge eligibility. 4. Offer to top-ranked driver. Wait N seconds for accept; if no, try next.
Concurrency. Two riders might match the same driver simultaneously. Use an optimistic lock: "UPDATE drivers SET state='offered' WHERE id=? AND state='available'" — only one update wins.
Surge pricing. A geographically- and temporally-aware multiplier. Live stream of (cell, supply, demand) → if demand >> supply, multiplier rises. Implemented as a Flink job on the location and request streams.
Live trip tracking. Driver's app pushes location every 2 seconds during a trip. Rider's app subscribes (WebSocket or polling). The trip service is essentially a small chat application keyed by trip_id.
Video streaming (YouTube)
Requirements. Upload videos in any format/resolution. Transcode to multiple streamable qualities. Serve to users with adaptive bitrate. Recommendations, comments, view counts.
Estimation. 500h of video uploaded/minute. 1B hours watched/day = ~250 Tbps of egress at average bitrate. Storage in petabytes; egress dwarfs everything.
Ingest and transcode.
Uploader ──► raw bytes ──► S3 (raw)
│
▼
Kafka: video_uploaded
│
▼
Transcoding workers (GPU)
│
┌─────────────────────┼─────────────────────┐
▼ ▼ ▼
240p chunks 480p chunks 1080p chunks
(S3) (S3) (S3)
│ │ │
└─────────────────────┼─────────────────────┘
▼
HLS / DASH manifests
│
▼
CDN
Transcoding is split into chunks (e.g., 2-second segments) so multiple workers transcode the same video in parallel. A single 1-hour video → ~1800 chunks per resolution × 5 resolutions = 9000 small files. Highly parallelisable; the bottleneck is GPU throughput, not coordination.
Adaptive bitrate (HLS/DASH). The player measures network throughput and switches quality every few seconds. The manifest is a small text file pointing at chunk URLs at each quality:
playlist.m3u8
├── 240p/segment_001.ts
├── 240p/segment_002.ts
├── 480p/segment_001.ts
├── 480p/segment_002.ts
├── 1080p/segment_001.ts
└── ...
The player downloads chunks on demand. The CDN caches them aggressively (immutable URLs). 99% of bytes are served by the CDN edge.
Storage tiering. Newly-uploaded videos go to standard storage. After 30 days, if view rate is low, move chunks to cooler storage (S3 IA). Truly cold videos move to Glacier. The metadata stays hot in Postgres; the bytes follow access patterns.
View counts. Naively incrementing a counter per view is 250k QPS of writes to one hot key — impossible on a single row. Pattern: each app server batches view counts locally, flushes to Kafka every few seconds; a stream processor aggregates into a per-video sharded counter; periodically writes back to the main DB. View counts are slightly delayed but accurate.
Recommendations. Out of scope here; in practice a giant ML pipeline (collaborative filtering + content embeddings + recency + personalisation) that publishes recommendations to a fast key-value store, looked up at request time.
Payment system
Requirements. Accept payments via card, UPI, wallet. Refunds, chargebacks. ACID, audit trail, fraud detection. Idempotency. PCI-DSS compliance.
Estimation. 10k payments/sec peak (a few large e-commerce sites combined). Each payment is many writes (charge attempt, gateway response, ledger entries, notifications, fraud checks).
The hard constraints.
- No double-charges. Idempotency is a hard requirement, not a nice-to-have.
- Audit trail. Every state change recorded. Disputes resolved by reading history.
- Strong consistency. Money is the canonical use case for ACID; eventual consistency is unacceptable on the ledger.
- Regulatory. PCI-DSS for card data; KYC/AML for higher-value transactions; varying per country.
Architecture.
Merchant ──► Payment API ──► Payment Service ──► Gateway (Stripe/etc)
│ │
│ ▼
│ Ledger (Postgres, ACID)
│ │
│ ▼
└────────► Fraud service (async)
│
▼
Notifications
Idempotency. Every payment request carries an idempotency_key from the merchant. The payment service records (key → result) in a dedicated table. On a duplicate, return the stored result without re-charging.
CREATE TABLE idempotency_keys (
key TEXT PRIMARY KEY,
request_hash TEXT NOT NULL, -- detect key reuse with different params
response_body JSONB,
status TEXT, -- pending, completed
created_at TIMESTAMPTZ DEFAULT NOW()
);
Lock the row on first request; release on completion. A second request for the same key sees the lock, waits or returns the in-progress status.
Ledger model. Double-entry bookkeeping: every transaction is a balanced pair of entries (debit one account, credit another). Sum across all entries for an account = current balance. Invariant: the sum of all entries across all accounts = 0.
CREATE TABLE ledger_entries (
id BIGSERIAL PRIMARY KEY,
transaction_id UUID NOT NULL,
account_id BIGINT NOT NULL,
amount_cents BIGINT NOT NULL, -- positive = credit, negative = debit
occurred_at TIMESTAMPTZ DEFAULT NOW()
);
-- Constraint: SUM(amount_cents) per transaction_id = 0
Writes are append-only. The ledger is the source of truth; balances are derived (often cached). Audit: every state change is a new row, never an update.
Gateway integration. External gateways (Stripe, Razorpay, PayU) handle the actual card processing. Your service talks to them over HTTPS with retries + idempotency keys. Webhook callbacks for asynchronous status changes (authorised, captured, refunded, disputed) — handle webhook duplicates by checking the webhook_id.
Saga for refunds. A refund flow spans payment service, ledger, and notification. If the gateway accepts the refund but the ledger write fails (very rare), you have a phantom refund. The compensating transaction is to reverse the refund at the gateway. Real payment systems have a reconciliation job that compares the gateway's books with the ledger nightly and flags discrepancies.
Fraud detection. Asynchronous; doesn't block the payment for performance. The payment is authorised, the bytes leave the building, and a fraud model scores it in the background — high-score transactions are captured immediately, low-score ones are held for review.
PCI-DSS. Don't store raw card numbers. Use the gateway's tokenisation — they return a token you store; the actual PAN never touches your servers. This reduces your PCI scope dramatically (from "every server in the path" to "the few servers that interact with the gateway").
Distributed ID generator
Requirements. Generate unique IDs for billions of objects/day. Roughly sortable by time. No central coordinator (or minimal). Fits in 64 bits (so it can be a database BIGINT).
Why this matters. UUIDs work but they are random — terrible for B-tree indexes (every insert is into a random page, killing cache locality). Auto-increment IDs from one database don't scale and create a hot bottleneck.
The canonical answer is Snowflake (originally Twitter), and variants like Sonyflake, ULID, Discord's variant, Instagram's variant.
Snowflake format (64 bits):
1 bit | 41 bits | 10 bits | 12 bits
sign | timestamp ms | machine ID | sequence
0 | since epoch | (1024 IDs) | (4096 per ms)
- Timestamp — milliseconds since a custom epoch. 41 bits gives ~69 years of range.
- Machine ID — 10 bits = 1024 unique generator IDs. Assigned per host (from ZooKeeper, etcd, or static config).
- Sequence — 12 bits = 4096 IDs per machine per millisecond. If exhausted, wait 1 ms.
Result: each machine generates up to ~4M IDs/second locally. 1024 machines = 4B IDs/second cluster-wide. No coordination per ID.
Properties:
- Sortable. Higher timestamp bits dominate, so IDs sort roughly by creation time.
- 64-bit fits in BIGINT — usable as a primary key in any database.
- No collisions as long as (timestamp, machine_id, sequence) is unique.
Failure modes:
- Clock going backwards — NTP correcting, leap second. If the timestamp moves backward, you might collide with IDs already generated. Mitigation: refuse to issue IDs while clock < last_issued_clock; wait it out.
- Machine ID collision — two machines accidentally configured with the same ID. Use a coordination service (etcd) to lease IDs.
Alternative: ULID. Same idea, but 128 bits encoded as a 26-character string. Lexically sortable, URL-safe. Pros: simpler to use (no machine-ID allocation). Cons: bigger (128 bits not 64).
When to use UUIDv7. The new UUID standard (2024) is a time-prefixed UUID — basically a standardised ULID with better tooling. Default choice for new systems if 128 bits is acceptable.
Database-side alternatives. Postgres BIGSERIAL works fine up to single-database scale. For sharded systems where you want each shard to generate IDs locally, you can give each shard a different start and increment (shard 0: 1, 5, 9...; shard 1: 2, 6, 10...). Doesn't sort by time but avoids the central counter.
This closes the canonical-designs module. The patterns recur — sharding by entity, write fan-out, async processing, hot caches, idempotency keys, append-only ledgers, presigned URLs, hybrid push/pull. Every design above combines a small set of techniques covered in earlier modules of this series. The next module fills the gaps — supplementary topics that didn't fit cleanly elsewhere.
⁂ Back to all modules