Home
High-Level System Design / Module 2 — Data Storage & Databases

Data Storage & Databases

SQL vs NoSQL, indexing, sharding, replication, object storage, NewSQL — picking the right storage for each part of a real system.


Why storage choice is the first architectural decision

Most other architectural mistakes are recoverable. Pick the wrong queue, you can swap it. Pick the wrong CDN, you change providers in a week. Pick the wrong storage engine for your central data model and you are rewriting your application for a year.

Storage is the slowest thing to change because every service in your system touches it, the data needs to migrate without downtime, and the access patterns calcify around what the original choice was good at. A team that picked Cassandra for what should have been a relational workload spends years adding a layer of joins they did not need to write.

The right framing is not "which database is best." It is: for each piece of data my system stores, what are the access patterns, the consistency requirements, and the growth profile? Different parts of the same system often need different storage. User accounts on Postgres. Session tokens in Redis. Product catalogue in Elasticsearch. Order history archived to S3. Each one chosen because it matches the shape of the work — not because it is the team's favourite tool.

Relational databases — when normalisation pays

Relational databases (Postgres, MySQL, SQL Server, Oracle) store data in tables with strictly enforced schemas, primary and foreign keys, and ACID transactions. The data model is normalised — facts live in one place and are referenced by ID from elsewhere — and the query language is SQL.

The reason relational databases have dominated for forty years is the combination of three things:

sql
-- Normalised: users and orders in separate tables, joined by a foreign key
CREATE TABLE users (
    id BIGSERIAL PRIMARY KEY,
    email TEXT UNIQUE NOT NULL,
    country TEXT,
    created_at TIMESTAMPTZ DEFAULT NOW()
);

CREATE TABLE orders (
    id BIGSERIAL PRIMARY KEY,
    user_id BIGINT NOT NULL REFERENCES users(id),
    amount_cents BIGINT NOT NULL,
    created_at TIMESTAMPTZ DEFAULT NOW()
);

CREATE INDEX idx_orders_user ON orders(user_id);
CREATE INDEX idx_users_country ON users(country);

The query for the example above is one line: a JOIN, a WHERE on country and on amount_cents, and a GROUP BY on month. The database planner figures out which indexes to use.

The traditional weakness — "relational databases don't scale" — has been heavily overstated. Modern Postgres on a beefy server handles tens of thousands of writes per second and millions of reads per second with read replicas. Most companies who think they need NoSQL would be fine on a properly-tuned RDBMS until they pass tens of millions of users. The architectures of Stack Overflow, GitHub, and Shopify run on relational databases under tremendous load.

Where relational databases genuinely struggle:

For anything that looks like a transactional system of record — orders, users, payments, inventory, tickets — start with a relational database. You will spend less time fighting your storage and more time shipping features.

NoSQL — four shapes, four reasons

"NoSQL" is not one thing. It is a label put on four very different storage models, each with different strengths. Understanding which kind of NoSQL you mean is more important than the NoSQL label itself.

1. Key-value stores (Redis, DynamoDB in key-value mode, etcd, RocksDB). The simplest data model: a key maps to an opaque value. Operations are GET, PUT, DELETE. Performance is exceptional because there is nothing to query — you fetch by key.

Use for: caches, session storage, rate-limit counters, configuration. Anything where you know the key in advance and don't need to query by anything else.

text
   session:abc123 ──► { user_id: 42, expires: 1729012345 }
   rate:192.0.2.1 ──► 47   (request count this minute)

2. Document stores (MongoDB, Couchbase, AWS DocumentDB). The value is a JSON-ish document with internal structure. You can query inside documents (find all users where addresses.country = 'IN'). Schemas are flexible — different documents in the same collection can have different fields.

Use for: content with variable structure (product catalogues across categories), user profiles with optional sections, anything where the natural unit of data is a nested document and the access pattern is "fetch this whole document."

Watch out: the flexibility cuts both ways. Without schema discipline you end up with a collection where half the documents have a field called address (string) and the other half have address (object). MongoDB now offers schema validation; use it.

3. Wide-column stores (Cassandra, ScyllaDB, HBase, BigTable). The model is a sparse table with billions of rows and many columns, where each row can have a different set of columns. The query model is severely restricted — you query by partition key, possibly with a range on clustering keys, and that is roughly it.

The payoff is linear write scalability. Cassandra writes at hundreds of thousands of ops per second on commodity hardware because every write goes to one shard and is replicated to a few peers. There is no global coordinator, no JOIN, no foreign key — and that is precisely what makes it fast.

Use for: time-series data (IoT telemetry, metrics, logs at scale), feeds where the access pattern is always "all rows for this user in this time range," massive event histories.

4. Graph databases (Neo4j, Amazon Neptune, ArangoDB). Data is nodes and edges. The query language traverses relationships: "find people who follow people who follow Alice," "find the shortest path between these two accounts."

Use for: anything where the relationships are the data — social networks, fraud detection (suspicious chains of transactions), recommendation engines, knowledge graphs. A relational database can do these queries but the SQL gets ugly fast (recursive CTEs, many self-joins). A graph database is built for traversal.

The honest rule: pick relational unless one of these four NoSQL models clearly matches your access pattern. "We might need to scale later" is not a reason to start on NoSQL. Postgres will get you to 10 million users; by the time you need to leave, you will know exactly what to leave to.

Indexing — making queries fast without scanning

An index is a separate data structure that lets the database find rows without reading the whole table. The single most common cause of slow database queries is a missing index on a column you are filtering or joining by.

The canonical index is a B-tree. It is a balanced tree where every leaf is the same depth from the root, each node holds a sorted range of keys, and lookups walk from root to leaf in O(log n). On a table with a billion rows, finding one row by indexed key is a handful of disk reads instead of a billion.

text
   B-tree index on users.email:

                  [m]
                 /   \
             [c, h]   [r, w]
            /  |  \   /  |  \
         ... leaves with sorted email values and row pointers ...

Look-up cost: O(log n). For 10^9 rows with a fanout of ~100 per node, that's roughly 5 levels — five disk reads in the worst case.

B-trees handle equality (WHERE email = 'a@b.com'), range (WHERE created_at > '2026-01-01'), and prefix (WHERE name LIKE 'al%') queries efficiently. They do NOT help suffix queries (LIKE '%foo') or queries with leading wildcards.

Hash indexes (Postgres has them; some NoSQL stores use them under the hood) are O(1) for equality but useless for range queries. Use them only when you know all queries will be equality and the lookup volume is huge.

Composite indexes cover multiple columns. The order matters enormously. An index on (country, age) helps queries that filter by country, or by country AND age — but it does NOT efficiently help a query that filters only by age. The rule: order the columns in the index by selectivity, with the most-filtered column first.

Indexes are not free. Every index slows down writes (because every insert/update has to update every relevant index) and consumes disk space (often 10-30% of the table size). The rule of thumb: index columns you filter by, JOIN by, and ORDER BY. Don't index columns you only ever read in the SELECT list.

A few mistakes that cost real teams real money:

Sharding — splitting one database into many

Once a single database server cannot keep up — too many writes, too much data, or both — the standard escape hatch is sharding. The data is split horizontally across multiple servers. Each server (shard) owns a subset of the rows.

text
   Single DB                    Sharded across 4 servers
   ─────────                    ────────────────────────
   ┌───────────┐                ┌─────────┐ ┌─────────┐
   │ users 0-N │       ─►       │  0-25%  │ │ 25-50%  │
   └───────────┘                └─────────┘ └─────────┘
                                ┌─────────┐ ┌─────────┐
                                │ 50-75%  │ │ 75-100% │
                                └─────────┘ └─────────┘

The core choice is the shard key: the column whose value decides which shard a row goes to. Three common strategies:

Consistent hashing (covered properly in Module 7) is the technique most modern shard maps use. Instead of hash(key) % N, the keys and shards are placed on a logical ring, and a key goes to the next shard clockwise. The benefit: when you add or remove a shard, only ~1/N of keys need to move, instead of nearly all of them with naive mod-hashing.

The costs of sharding are real and underestimated:

The advice that almost always works: do not shard until you have proven the single-server option is exhausted. Vertical scaling (bigger machine), read replicas, caching, query optimisation, and archiving old data buy you many years of runway. Sharding is the last resort, not the first.

Replication — copies for read scale and failover

Replication is keeping copies of the data on multiple servers. It addresses two different problems at once: serving more read traffic and surviving the failure of a single node.

The two basic shapes:

For most teams, primary-replica is the right answer. The configuration choice that matters most is synchronous vs asynchronous replication:

Most production setups use a hybrid: synchronous to one local replica (for safety against single-node failure), asynchronous to remote replicas (for read scale and disaster recovery).

text
          ┌─────────┐   sync   ┌──────────┐
   Write─►│ Primary │ ───────► │ Replica1 │  (same datacentre)
          └────┬────┘          └──────────┘
               │
               │ async
               ▼
          ┌──────────┐
          │ Replica2 │  (other datacentre — DR)
          └──────────┘

Three practical concerns once you have replicas:

Object storage — the right place for files

Object storage (S3, GCS, Azure Blob) is the answer to a problem databases are bad at: storing files. Images, videos, PDFs, backups, ML training data, log archives — anywhere the unit is a blob, the access pattern is read-mostly, and you don't need to query inside the content.

What makes object storage different from a filesystem or a database BLOB column:

text
   Browser ──PUT (presigned URL)──► S3
      │                              │
      │                              ▼
      │                       Stored as object
      │                              │
      ▼                              ▼
   POST metadata ──► App ──► DB (file_id, owner, s3_key, size)

The canonical pattern for file uploads: the app signs a URL, the client uploads directly to S3 (bypassing your servers), then notifies the app of the new file ID. Your bandwidth bill drops, your app servers are free for application work, and the upload speed is whatever S3 can manage — which is fast.

Storage tiers let you trade access speed for cost. S3 Standard is for hot data. S3 Standard-IA (Infrequent Access) is cheaper per GB but charges retrieval fees. S3 Glacier is dirt cheap but retrieval takes minutes to hours. Move old data through these tiers with lifecycle policies — if a backup is two years old, it can almost always wait an hour to be retrieved.

What object storage is NOT:

NewSQL and specialised databases

Two more storage categories worth knowing because they show up in modern stacks.

NewSQL (CockroachDB, Spanner, TiDB, YugabyteDB) is the attempt to keep SQL's developer experience while gaining the horizontal scalability of NoSQL. They look like Postgres or MySQL on the wire — you write SQL, you get ACID transactions — but the storage engine is distributed underneath. Writes are coordinated via a consensus protocol (usually Raft), data is sharded automatically, and the cluster can grow by adding nodes.

The costs are real. Cross-node transactions have higher latency than single-node ones because of the consensus round-trip. Some SQL features (certain window functions, complex stored procedures) are limited or unsupported. Operational complexity is meaningfully higher than running a managed Postgres.

Reach for NewSQL when you genuinely need both: SQL semantics AND horizontal scale beyond what a single Postgres can handle. Most teams don't.

Time-series databases (InfluxDB, TimescaleDB, Prometheus's local store) are optimised for a very specific access pattern: append-mostly writes of (timestamp, tags, value) rows, then queries over time ranges with aggregations. They handle 100k-1M writes per second per node by exploiting that the timestamp is monotonic and data older than a few hours is rarely modified.

Use for: metrics, IoT telemetry, financial tick data. Don't shoehorn user data into a time-series database because someone said it was fast — it is fast for one particular shape of work.

Search engines (Elasticsearch, OpenSearch, Solr, Meilisearch) are optimised for full-text and faceted search across documents. They build inverted indexes that map every word to the documents containing it, plus relevance scoring (BM25), fuzzy matching, and aggregations. A naive LIKE '%term%' query in SQL is O(n); Elasticsearch returns the same answer over a billion documents in tens of milliseconds.

Pattern: keep the source of truth in a relational database, and pipe writes to a search index (often via a queue or change-data-capture). The index is rebuildable; the database is the system of record.

Vector databases (Pinecone, Weaviate, Milvus, pgvector) store embeddings — high-dimensional float vectors that represent the semantic content of text, images, or audio — and search by nearest neighbour. The query is "find the 10 vectors closest to this one" using cosine similarity or Euclidean distance. They are the storage layer of RAG (retrieval-augmented generation) systems.

Use when: you need semantic search ("find documents about retirement planning" without the document containing the words "retirement planning"). pgvector is fine until you need to scale past a few million vectors; specialised vector databases earn their keep at billions of vectors with sub-100ms queries.

The overall picture is that storage is no longer one decision. A real backend touches three to seven different stores, each picked for one job. The job of the architect is not to find the perfect database — it is to pick a small set, each playing the role it is good at, and to keep the boundaries between them clean. The next module covers what sits in front of those storage layers: caches.


⁂ Back to all modules