Data Storage & Databases
SQL vs NoSQL, indexing, sharding, replication, object storage, NewSQL — picking the right storage for each part of a real system.
Why storage choice is the first architectural decision
Most other architectural mistakes are recoverable. Pick the wrong queue, you can swap it. Pick the wrong CDN, you change providers in a week. Pick the wrong storage engine for your central data model and you are rewriting your application for a year.
Storage is the slowest thing to change because every service in your system touches it, the data needs to migrate without downtime, and the access patterns calcify around what the original choice was good at. A team that picked Cassandra for what should have been a relational workload spends years adding a layer of joins they did not need to write.
The right framing is not "which database is best." It is: for each piece of data my system stores, what are the access patterns, the consistency requirements, and the growth profile? Different parts of the same system often need different storage. User accounts on Postgres. Session tokens in Redis. Product catalogue in Elasticsearch. Order history archived to S3. Each one chosen because it matches the shape of the work — not because it is the team's favourite tool.
Relational databases — when normalisation pays
Relational databases (Postgres, MySQL, SQL Server, Oracle) store data in tables with strictly enforced schemas, primary and foreign keys, and ACID transactions. The data model is normalised — facts live in one place and are referenced by ID from elsewhere — and the query language is SQL.
The reason relational databases have dominated for forty years is the combination of three things:
- Strong consistency by default. A transaction either commits fully or rolls back. A foreign key cannot point at a non-existent row.
- Ad-hoc queries. Because data is normalised and indexed, you can answer questions you did not anticipate when you designed the schema. "How many users from Germany signed up in Q3 and then placed an order over $200?" — that is one SQL query against a schema built for general use.
- Mature tooling. Decades of optimisers, monitoring tools, backup utilities, GUI clients, and a labour market full of people who already know SQL.
-- Normalised: users and orders in separate tables, joined by a foreign key
CREATE TABLE users (
id BIGSERIAL PRIMARY KEY,
email TEXT UNIQUE NOT NULL,
country TEXT,
created_at TIMESTAMPTZ DEFAULT NOW()
);
CREATE TABLE orders (
id BIGSERIAL PRIMARY KEY,
user_id BIGINT NOT NULL REFERENCES users(id),
amount_cents BIGINT NOT NULL,
created_at TIMESTAMPTZ DEFAULT NOW()
);
CREATE INDEX idx_orders_user ON orders(user_id);
CREATE INDEX idx_users_country ON users(country);
The query for the example above is one line: a JOIN, a WHERE on country and on amount_cents, and a GROUP BY on month. The database planner figures out which indexes to use.
The traditional weakness — "relational databases don't scale" — has been heavily overstated. Modern Postgres on a beefy server handles tens of thousands of writes per second and millions of reads per second with read replicas. Most companies who think they need NoSQL would be fine on a properly-tuned RDBMS until they pass tens of millions of users. The architectures of Stack Overflow, GitHub, and Shopify run on relational databases under tremendous load.
Where relational databases genuinely struggle:
- Schemas that change shape rapidly. Every column addition is a migration. Once you are at scale, even fast migrations need careful ordering — adding a NOT NULL column to a billion-row table is a multi-hour exercise.
- Workloads with extreme write volume on a single hot table. A single primary node has a write ceiling. Sharding (covered later in this module) helps but introduces complexity.
- Storing very large blobs. Don't put 50MB PDFs in a database row. Use object storage and store the URL.
For anything that looks like a transactional system of record — orders, users, payments, inventory, tickets — start with a relational database. You will spend less time fighting your storage and more time shipping features.
NoSQL — four shapes, four reasons
"NoSQL" is not one thing. It is a label put on four very different storage models, each with different strengths. Understanding which kind of NoSQL you mean is more important than the NoSQL label itself.
1. Key-value stores (Redis, DynamoDB in key-value mode, etcd, RocksDB). The simplest data model: a key maps to an opaque value. Operations are GET, PUT, DELETE. Performance is exceptional because there is nothing to query — you fetch by key.
Use for: caches, session storage, rate-limit counters, configuration. Anything where you know the key in advance and don't need to query by anything else.
session:abc123 ──► { user_id: 42, expires: 1729012345 }
rate:192.0.2.1 ──► 47 (request count this minute)
2. Document stores (MongoDB, Couchbase, AWS DocumentDB). The value is a JSON-ish document with internal structure. You can query inside documents (find all users where addresses.country = 'IN'). Schemas are flexible — different documents in the same collection can have different fields.
Use for: content with variable structure (product catalogues across categories), user profiles with optional sections, anything where the natural unit of data is a nested document and the access pattern is "fetch this whole document."
Watch out: the flexibility cuts both ways. Without schema discipline you end up with a collection where half the documents have a field called address (string) and the other half have address (object). MongoDB now offers schema validation; use it.
3. Wide-column stores (Cassandra, ScyllaDB, HBase, BigTable). The model is a sparse table with billions of rows and many columns, where each row can have a different set of columns. The query model is severely restricted — you query by partition key, possibly with a range on clustering keys, and that is roughly it.
The payoff is linear write scalability. Cassandra writes at hundreds of thousands of ops per second on commodity hardware because every write goes to one shard and is replicated to a few peers. There is no global coordinator, no JOIN, no foreign key — and that is precisely what makes it fast.
Use for: time-series data (IoT telemetry, metrics, logs at scale), feeds where the access pattern is always "all rows for this user in this time range," massive event histories.
4. Graph databases (Neo4j, Amazon Neptune, ArangoDB). Data is nodes and edges. The query language traverses relationships: "find people who follow people who follow Alice," "find the shortest path between these two accounts."
Use for: anything where the relationships are the data — social networks, fraud detection (suspicious chains of transactions), recommendation engines, knowledge graphs. A relational database can do these queries but the SQL gets ugly fast (recursive CTEs, many self-joins). A graph database is built for traversal.
The honest rule: pick relational unless one of these four NoSQL models clearly matches your access pattern. "We might need to scale later" is not a reason to start on NoSQL. Postgres will get you to 10 million users; by the time you need to leave, you will know exactly what to leave to.
Indexing — making queries fast without scanning
An index is a separate data structure that lets the database find rows without reading the whole table. The single most common cause of slow database queries is a missing index on a column you are filtering or joining by.
The canonical index is a B-tree. It is a balanced tree where every leaf is the same depth from the root, each node holds a sorted range of keys, and lookups walk from root to leaf in O(log n). On a table with a billion rows, finding one row by indexed key is a handful of disk reads instead of a billion.
B-tree index on users.email:
[m]
/ \
[c, h] [r, w]
/ | \ / | \
... leaves with sorted email values and row pointers ...
Look-up cost: O(log n). For 10^9 rows with a fanout of ~100 per node, that's roughly 5 levels — five disk reads in the worst case.
B-trees handle equality (WHERE email = 'a@b.com'), range (WHERE created_at > '2026-01-01'), and prefix (WHERE name LIKE 'al%') queries efficiently. They do NOT help suffix queries (LIKE '%foo') or queries with leading wildcards.
Hash indexes (Postgres has them; some NoSQL stores use them under the hood) are O(1) for equality but useless for range queries. Use them only when you know all queries will be equality and the lookup volume is huge.
Composite indexes cover multiple columns. The order matters enormously. An index on (country, age) helps queries that filter by country, or by country AND age — but it does NOT efficiently help a query that filters only by age. The rule: order the columns in the index by selectivity, with the most-filtered column first.
Indexes are not free. Every index slows down writes (because every insert/update has to update every relevant index) and consumes disk space (often 10-30% of the table size). The rule of thumb: index columns you filter by, JOIN by, and ORDER BY. Don't index columns you only ever read in the SELECT list.
A few mistakes that cost real teams real money:
- Not indexing foreign keys. Most databases don't index foreign keys automatically. A JOIN against an unindexed FK is a full scan. Always index FKs.
- Indexing low-cardinality columns standalone. A boolean column "active" is useless to index on its own — half the rows match. As part of a composite index, fine.
- Letting unused indexes pile up. Old indexes from features that got removed still incur write cost forever. Query
pg_stat_user_indexes(Postgres) orsys.dm_db_index_usage_stats(SQL Server) periodically and drop ones with zero scans. - Function-wrapped columns kill indexes.
WHERE LOWER(email) = 'a@b.com'cannot use a normal index onemail. You either query by the raw column or create a functional index onLOWER(email).
Sharding — splitting one database into many
Once a single database server cannot keep up — too many writes, too much data, or both — the standard escape hatch is sharding. The data is split horizontally across multiple servers. Each server (shard) owns a subset of the rows.
Single DB Sharded across 4 servers
───────── ────────────────────────
┌───────────┐ ┌─────────┐ ┌─────────┐
│ users 0-N │ ─► │ 0-25% │ │ 25-50% │
└───────────┘ └─────────┘ └─────────┘
┌─────────┐ ┌─────────┐
│ 50-75% │ │ 75-100% │
└─────────┘ └─────────┘
The core choice is the shard key: the column whose value decides which shard a row goes to. Three common strategies:
- Hash-based.
shard = hash(user_id) % shard_count. Distributes evenly, no hot shards, but range queries ("all users created in March") have to fan out to every shard. - Range-based. Shard 1 holds IDs 0-1M, shard 2 holds 1M-2M, etc. Range queries are efficient — they hit one shard. But writes can pile up on the "newest" shard, creating a hot shard.
- Geo / tenant-based. Shard by country, or by customer. Natural for multi-tenant SaaS where one customer's data should stay together. Risk: one large customer overwhelms a shard.
Consistent hashing (covered properly in Module 7) is the technique most modern shard maps use. Instead of hash(key) % N, the keys and shards are placed on a logical ring, and a key goes to the next shard clockwise. The benefit: when you add or remove a shard, only ~1/N of keys need to move, instead of nearly all of them with naive mod-hashing.
The costs of sharding are real and underestimated:
- Cross-shard queries become expensive or impossible. A SQL JOIN across shards needs application-level orchestration or a sharding proxy.
- Transactions across shards need a distributed protocol (two-phase commit, or saga patterns — Module 6). Single-shard transactions still work; cross-shard ones don't, by default.
- Hot shards appear when your shard key is poorly chosen. The classic example: sharding tweets by user_id when one celebrity has 100M followers. That celebrity's shard is now serving all the read traffic.
- Re-sharding is painful. Doubling the shard count requires migrating data, often online, often with traffic. Plan for it from the start: choose more virtual shards than physical shards so you can move shards between hosts without re-keying everything.
The advice that almost always works: do not shard until you have proven the single-server option is exhausted. Vertical scaling (bigger machine), read replicas, caching, query optimisation, and archiving old data buy you many years of runway. Sharding is the last resort, not the first.
Replication — copies for read scale and failover
Replication is keeping copies of the data on multiple servers. It addresses two different problems at once: serving more read traffic and surviving the failure of a single node.
The two basic shapes:
- Primary-replica (single-leader). One node accepts writes. The others (replicas) receive a stream of the writes and apply them locally. Reads can hit any node, but writes go only to the primary. The simplest and most common.
- Multi-leader / multi-primary. Multiple nodes accept writes. They sync changes among themselves. Used in multi-region deployments where you want low-latency writes from every region. Brings conflict-resolution complexity — two regions can write the same row simultaneously.
- Leaderless. Any node accepts writes; reads ask multiple nodes and reconcile. Cassandra and Dynamo use this. Quorum-based reads and writes (Module 7) provide tunable consistency.
For most teams, primary-replica is the right answer. The configuration choice that matters most is synchronous vs asynchronous replication:
- Asynchronous. The primary commits and acknowledges the client before the replica has received the write. Replication lag (the gap between primary and replica) is normally small (milliseconds) but can spike. Risk: if the primary dies and you fail over to a replica, some recent writes may be lost.
- Synchronous. The primary waits for at least one replica to acknowledge before reporting success. Risk: write latency goes up. If the replica is slow or unreachable, writes stall.
Most production setups use a hybrid: synchronous to one local replica (for safety against single-node failure), asynchronous to remote replicas (for read scale and disaster recovery).
┌─────────┐ sync ┌──────────┐
Write─►│ Primary │ ───────► │ Replica1 │ (same datacentre)
└────┬────┘ └──────────┘
│
│ async
▼
┌──────────┐
│ Replica2 │ (other datacentre — DR)
└──────────┘
Three practical concerns once you have replicas:
- Read-your-writes consistency. A user posts a comment and immediately reloads. If the reload hits a lagging replica, the comment is missing. Fix: route a user's reads to the primary for a short window after their writes; or use sticky sessions; or read from a replica only if it has acknowledged a certain log position.
- Failover correctness. When the primary dies, a replica is promoted. If both old primary and new primary come back online, you have split brain and they will both accept conflicting writes. Use a consensus protocol (Raft, etcd, ZooKeeper) to elect a single primary — or use a managed service (RDS, Cloud SQL) that handles this for you.
- Read replicas are NOT a substitute for backups. A bad DELETE on the primary replicates to every replica in seconds. You still need point-in-time backups.
Object storage — the right place for files
Object storage (S3, GCS, Azure Blob) is the answer to a problem databases are bad at: storing files. Images, videos, PDFs, backups, ML training data, log archives — anywhere the unit is a blob, the access pattern is read-mostly, and you don't need to query inside the content.
What makes object storage different from a filesystem or a database BLOB column:
- Effectively unlimited capacity. You don't provision capacity. S3 is internally split across thousands of servers; from your perspective, you can write trillions of objects.
- Eleven 9s of durability (S3's claim: 99.999999999%). The object is replicated across multiple physical locations; losing it requires multiple datacentres failing simultaneously.
- Cheap per gigabyte, expensive per request. ~$0.02/GB/month for standard tier, ~$0.0004 per 1000 GET requests. Storing a billion small files is much more expensive than storing one billion-byte file.
- Flat namespace. There are no real directories — keys are just strings, and the slash is a convention used by tools to display them as a tree. Listing "everything under /users/42/" is a prefix scan, not a directory read.
- HTTP-native. You GET and PUT objects by URL. Combined with presigned URLs (a temporary URL that grants read or write access without exposing credentials), this is the basis of every modern file-upload flow.
Browser ──PUT (presigned URL)──► S3
│ │
│ ▼
│ Stored as object
│ │
▼ ▼
POST metadata ──► App ──► DB (file_id, owner, s3_key, size)
The canonical pattern for file uploads: the app signs a URL, the client uploads directly to S3 (bypassing your servers), then notifies the app of the new file ID. Your bandwidth bill drops, your app servers are free for application work, and the upload speed is whatever S3 can manage — which is fast.
Storage tiers let you trade access speed for cost. S3 Standard is for hot data. S3 Standard-IA (Infrequent Access) is cheaper per GB but charges retrieval fees. S3 Glacier is dirt cheap but retrieval takes minutes to hours. Move old data through these tiers with lifecycle policies — if a backup is two years old, it can almost always wait an hour to be retrieved.
What object storage is NOT:
- It is not a transactional store. Two clients writing the same key race; last write wins. There are no transactions, no foreign keys, no joins.
- It is not a search engine. Listing and finding by metadata is limited. Pair object storage with a database that holds the metadata and points at the object keys.
- It is not low-latency. A single GET is 50-200 ms — fine for serving downloads through a CDN, slow as the primary store for a hot API path.
NewSQL and specialised databases
Two more storage categories worth knowing because they show up in modern stacks.
NewSQL (CockroachDB, Spanner, TiDB, YugabyteDB) is the attempt to keep SQL's developer experience while gaining the horizontal scalability of NoSQL. They look like Postgres or MySQL on the wire — you write SQL, you get ACID transactions — but the storage engine is distributed underneath. Writes are coordinated via a consensus protocol (usually Raft), data is sharded automatically, and the cluster can grow by adding nodes.
The costs are real. Cross-node transactions have higher latency than single-node ones because of the consensus round-trip. Some SQL features (certain window functions, complex stored procedures) are limited or unsupported. Operational complexity is meaningfully higher than running a managed Postgres.
Reach for NewSQL when you genuinely need both: SQL semantics AND horizontal scale beyond what a single Postgres can handle. Most teams don't.
Time-series databases (InfluxDB, TimescaleDB, Prometheus's local store) are optimised for a very specific access pattern: append-mostly writes of (timestamp, tags, value) rows, then queries over time ranges with aggregations. They handle 100k-1M writes per second per node by exploiting that the timestamp is monotonic and data older than a few hours is rarely modified.
Use for: metrics, IoT telemetry, financial tick data. Don't shoehorn user data into a time-series database because someone said it was fast — it is fast for one particular shape of work.
Search engines (Elasticsearch, OpenSearch, Solr, Meilisearch) are optimised for full-text and faceted search across documents. They build inverted indexes that map every word to the documents containing it, plus relevance scoring (BM25), fuzzy matching, and aggregations. A naive LIKE '%term%' query in SQL is O(n); Elasticsearch returns the same answer over a billion documents in tens of milliseconds.
Pattern: keep the source of truth in a relational database, and pipe writes to a search index (often via a queue or change-data-capture). The index is rebuildable; the database is the system of record.
Vector databases (Pinecone, Weaviate, Milvus, pgvector) store embeddings — high-dimensional float vectors that represent the semantic content of text, images, or audio — and search by nearest neighbour. The query is "find the 10 vectors closest to this one" using cosine similarity or Euclidean distance. They are the storage layer of RAG (retrieval-augmented generation) systems.
Use when: you need semantic search ("find documents about retirement planning" without the document containing the words "retirement planning"). pgvector is fine until you need to scale past a few million vectors; specialised vector databases earn their keep at billions of vectors with sub-100ms queries.
The overall picture is that storage is no longer one decision. A real backend touches three to seven different stores, each picked for one job. The job of the architect is not to find the perfect database — it is to pick a small set, each playing the role it is good at, and to keep the boundaries between them clean. The next module covers what sits in front of those storage layers: caches.
⁂ Back to all modules