Databases
ACID, indexing, N+1, migrations, connection pooling — the database fundamentals.
A Brief History — Why Databases Exist At All
In the early days of computing, data was stored on magnetic tapes as flat files. Every application owned its own files, defined its own formats, and wrote its own code to read them.
The deeper problem was that the data was tightly coupled to the program. If you wanted to reorganize how data was stored on disk, every program that touched that data had to be rewritten. Data and logic were tangled together.
In 1970, Edgar F. Codd published a paper that changed everything. The short gist:
1. Data should be represented as flat tables (relations) of rows and columns. Nothing else. No pointers, no navigation, no hierarchy. Just tables.
2. There should be complete separation between the logical view of data (tables and their relationships) and the physical storage on disk. The application declares WHAT it wants. The database engine figures out HOW to get it.
This is called data independence — and it's the reason your application code doesn't have to know whether the database is using a B-tree, a hash index, or scanning a heap. You write SQL, and the engine handles the rest.
Every database concept that follows — indexes, query planners, transactions, isolation levels — exists in service of this one idea: let humans think in terms of meaning, and let the machine figure out the physics.
Relational vs Non-Relational
Relational (SQL) — PostgreSQL, MySQL, SQLite
• Data in tables with rows and columns
• Strong schema enforcement
• ACID transactions
• Powerful JOIN queries
• Best for: structured data with relationships, financial systems, anything needing consistency
Document (NoSQL) — MongoDB, DynamoDB
• Data as JSON documents in collections
• Flexible schema
• Horizontal scaling built-in
• Best for: varying structures, hierarchical data, high write throughput
Others:
• Key-Value: Redis, DynamoDB (caching, sessions)
• Wide-Column: Cassandra (time-series, IoT)
• Graph: Neo4j (social networks, recommendation engines)
• Time-Series: InfluxDB, TimescaleDB (metrics, monitoring)
Most apps: PostgreSQL + Redis. Add others only for specific needs.
ACID Transactions
ACID guarantees that database operations are reliable:
Atomicity — Either ALL operations in a transaction succeed, or NONE do. No partial updates.
Example: Transfer $100 from Alice to Bob. Debit Alice AND credit Bob must both succeed or both fail.
Consistency — The database is always in a valid state. Constraints (foreign keys, unique indexes) are enforced.
Isolation — Concurrent transactions don't see each other's partial work. Multiple users editing simultaneously don't corrupt each other's data.
Durability — Once committed, data survives crashes. Written to disk, not just memory.
Isolation levels (from weakest to strongest):
Read Uncommitted → Read Committed → Repeatable Read → Serializable
PostgreSQL default: Read Committed. For financial operations: Serializable.
Inside the Engine — How an RDBMS Actually Works
Most developers treat the database as a black box. Lifting the lid even a little makes you dramatically better at debugging slow queries.
Storage layer: pages and the heap
Data on disk is stored in fixed-size blocks called pages — typically 8 KB in PostgreSQL, 16 KB in InnoDB. A table is just a collection of pages.
Each page has three parts:
• Header — metadata about the page
• Slot array — pointers to where each row starts within the page
• Row data — packed from the bottom of the page upward
When you insert a row, the engine finds a page with enough free space, writes the row, and adds an entry to the slot array. When you delete, the slot is marked dead but the space isn't reclaimed immediately — a background process (VACUUM in PostgreSQL, purge in InnoDB) cleans up later. That's why a heavily-updated table can grow much larger than the data it holds.
┌────────── Page (8 KB) ──────────┐
│ Header │
│ Slot 1 → row at offset 7800 │
│ Slot 2 → row at offset 7600 │
│ ... │
│ (free space) │
│ ... │
│ [Row 2 data] │
│ [Row 1 data] │
└─────────────────────────────────┘
The B-Tree index
A full table scan (reading every page) is O(n) — fast for 1,000 rows, catastrophic for 10 million. An index changes the game.
A B-Tree index is a self-balancing tree where leaf nodes store the indexed values plus pointers (page number + slot number) back to the actual rows in the heap. Lookups walk the tree top-down — O(log n).
When you query WHERE id = 42:
1. Engine walks the B-Tree from root to leaf — typically 3 to 4 hops for millions of rows
2. Leaf node says "row 42 is on page 1037, slot 5"
3. Engine reads page 1037 directly and returns the row
That's why indexed lookups feel instant even on huge tables.
How a Query Actually Runs
When you send a SQL query, four things happen — in this order:
1. Parsing — The text is converted into an Abstract Syntax Tree (AST). Syntax errors are caught here ("you wrote SELEKT instead of SELECT").
2. Semantic analysis — The engine resolves table names and column names, checks they exist, and verifies you have permission to read them. Errors like "column 'foo' does not exist" come from this stage.
3. Query optimization — This is where the magic happens. The optimizer takes your logical query and generates many possible physical execution plans:
• Should it use the index on email or do a full scan?
• Should it join customers to orders, or orders to customers?
• Should it use a hash join, a merge join, or a nested-loop join?
The optimizer estimates the cost of each plan using statistics it keeps about each table (row count, value distributions, index sizes), and picks the cheapest one. This is a hard problem — IBM's System R prototype in the 1970s pioneered cost-based optimization, and modern engines still build on those ideas.
4. Execution — The chosen plan runs, pulling rows through a pipeline of operators (scan → filter → sort → aggregate → return).
You can see exactly what the optimizer chose:
EXPLAIN ANALYZE
SELECT u.name, COUNT(o.id)
FROM users u
LEFT JOIN orders o ON o.user_id = u.id
WHERE u.created_at > '2024-01-01'
GROUP BY u.name;
EXPLAIN ANALYZE shows the plan, the estimated row counts, the actual row counts, and how long each step took. When a query is slow, this is the first place to look.
Durability & Concurrency — WAL and MVCC
Two more concepts that matter when things get serious.
Write-Ahead Log (WAL) — How durability is guaranteed.
Before any change is made to a data page, the change is first written to an append-only log on disk. The actual data pages are written lazily in the background.
If the server crashes mid-write:
1. Most recent data pages might not have hit disk yet
2. On restart, the engine replays the WAL to redo committed transactions
3. Incomplete transactions are rolled back
This is how databases promise "if it's committed, it's safe" without paying the cost of synchronizing every page write.
MVCC — Multi-Version Concurrency Control.
Old systems used locking: if writer A is updating row 42, readers wait until A is done. Slow at scale.
Modern databases (PostgreSQL, InnoDB, Oracle) use MVCC instead:
• Each row has a version chain
• A writer creates a NEW version of a row rather than overwriting the old one
• Each transaction sees a consistent snapshot of the database as of when the transaction started
• Readers never block writers; writers never block readers
The trade-off: dead row versions accumulate, requiring background cleanup (VACUUM in PostgreSQL). That's why PostgreSQL's autovacuum tuning matters on write-heavy tables.
When NoSQL — and Which Kind
NoSQL isn't one thing. The label covers four very different families, each born to solve a specific problem that relational databases struggled with.
Document databases — MongoDB, DynamoDB, Couchbase
The premise: many applications work naturally with self-contained documents (a user profile with nested addresses, a blog post with embedded comments). Forcing this into normalized tables with joins adds unnecessary complexity. Documents = JSON-shaped, flexible schema, fast on read.
When to choose: content-heavy apps, varying record shapes, hierarchical data.
Column-family stores — Cassandra, BigTable, HBase
The premise: write at massive scale across multiple data centers, with no single point of failure. Cassandra came out of Facebook in 2008; BigTable from Google in 2004 — both born to handle data sizes traditional databases simply couldn't.
When to choose: massive write throughput, time-series at scale, distributed systems with strict availability requirements.
Graph databases — Neo4j, ArangoDB
The premise: highly connected data — social networks, recommendation engines, fraud detection — is natural as nodes and edges, but expensive in SQL (a 5-level relationship needs 5 joins, each O(n)). Graph databases store relationships as first-class citizens.
When to choose: traversal-heavy queries ("friends of friends who liked X").
Time-series databases — InfluxDB, TimescaleDB, Prometheus
The premise: metrics and IoT data are append-only, time-ordered, and queried in ranges — completely different access patterns from transactional data.
When to choose: monitoring, IoT telemetry, anything where time is the primary axis.
The takeaway: every database is solving the same gap — between how humans want to think about data and how machines actually store it — with a different trade-off. Pick based on access patterns, not hype.
Database Concepts Used Daily
A few more fundamentals you'll encounter every week as a backend engineer:
Data types — int, varchar, text, uuid, timestamp, jsonb. Pick the smallest type that fits. uuid for IDs (never auto-increment integers in distributed systems). timestamptz for any time field, always in UTC.
Primary keys — uniquely identify each row. Use UUIDs in modern systems; they're collision-safe across distributed nodes.
Foreign keys — reference another table's primary key, enforcing relationships at the database level (not just in app code).
Referential integrity — the database refuses operations that would orphan records. CASCADE deletes children when a parent is removed. SET NULL nulls out the foreign key. RESTRICT blocks the delete entirely.
Constraints — NOT NULL, UNIQUE, CHECK (price > 0), DEFAULT 0. Enforce invariants at the data layer so bad data can't sneak in through any code path.
Triggers — code that runs automatically on INSERT/UPDATE/DELETE. Common use: auto-updating an updated_at column. Use sparingly — they make debugging harder because behavior happens "magically".
Views — a saved SELECT statement, queryable like a table. Useful for hiding complexity behind a stable interface, exposing only some columns to certain roles, or pre-joining frequently-combined tables. The view is recomputed each time it's queried (unless you use a materialized view, which caches the result).
Indexing
Without an index, a query does a full table scan — O(n) where n is row count. An index is a sorted data structure (B-Tree) that allows O(log n) lookups.
Always index:
• Primary keys (automatic)
• Foreign keys
• Columns used in WHERE clauses
• Columns used in JOIN conditions
• Columns used in ORDER BY (if queried frequently)
Composite indexes: index on (a, b) helps queries on a alone or a+b together, NOT b alone. Order in the index matters.
Partial indexes: index WHERE status = 'pending' — only indexes a subset of rows.
Unique indexes: enforce uniqueness at the database level (not just application level).
Cost: Indexes speed up reads but slow down writes (must update the index). Too many indexes on a write-heavy table is a problem.
N+1 Query Problem
The most common ORM pitfall:
// Bad: N+1 queries
const users = await db.users.findMany(); // 1 query
for (const user of users) {
user.posts = await db.posts.findMany({ where: { userId: user.id } }); // N queries
}
// If 100 users → 101 queries to the database!
// Good: 2 queries (or 1 JOIN)
const users = await db.users.findMany({
include: { posts: true } // JOIN or batch load in one query
});
OR use a JOIN:
SELECT users.*, posts.*
FROM users
LEFT JOIN posts ON posts.user_id = users.id
WHERE users.id IN (1, 2, 3);
Always monitor query count in development. ORMs make N+1 dangerously easy to write.
Migrations
A migration is a versioned, ordered script that changes the database schema. It's the version control for your database.
Never modify the database manually in production. Write a migration:
-- 20240115_add_user_phone.sql
ALTER TABLE users ADD COLUMN phone VARCHAR(20);
CREATE INDEX idx_users_phone ON users(phone);
Migration tools: Flyway, Liquibase, Alembic (Python), golang-migrate, Prisma Migrate.
Rules:
• Migrations are immutable — never edit a committed migration
• Make migrations reversible when possible (include rollback)
• Run migrations before deploying new code
• In CI/CD, apply migrations as part of deployment
• Never drop columns immediately — mark deprecated, then remove after old code is gone
Connection Pooling
Opening a database connection is expensive (~100ms). Creating a new one per request would kill performance.
A connection pool maintains a set of pre-opened connections. When your code needs to query, it borrows one from the pool, uses it, and returns it.
Configuration that matters:
• min connections: always keep this many open (2-5 typical)
• max connections: cap on simultaneous connections (match DB max_connections)
• acquire timeout: how long to wait for a connection before throwing an error
• idle timeout: close connections unused for this long
PostgreSQL max_connections default: 100. With 10 app servers, max pool size = 9 each (leave headroom for admin).
Use PgBouncer as a connection pooler between your app and PostgreSQL in high-concurrency setups.
The Backend from First Principles series is based on what I learnt from Sriniously's YouTube playlist — a thoughtful, framework-agnostic walk through backend engineering. If this material helped you, please go check the original out: youtube.com/@Sriniously. The notes here are my own restatement for revisiting later.