Full-Text Search
Inverted indexes, BM25, Elasticsearch — why LIKE queries don't scale.
Why SQL LIKE Falls Apart
The instinct, when a user wants to search a database, is to write something like:
SELECT * FROM articles WHERE title LIKE '%machine learning%';
This works for a few hundred rows. It falls apart everywhere else, for two reasons.
It's slow. The leading wildcard %machine means the database can't use any index — it has to scan every row, comparing every title character by character. On 10 million articles, that's seconds per query. Multiply by every concurrent user.
It's dumb. LIKE has no idea what you actually meant.
• "machine learning" doesn't match "Machine Learning" without LOWER() everywhere
• "ML" doesn't match "machine learning"
• A typo like "machien" matches nothing
• A document mentioning the term once ranks the same as one mentioning it 20 times
Real search needs to flip the data model entirely.
The Inverted Index
The trick that makes search fast is the inverted index. Instead of asking "for each document, does it contain this word?", we ask "for this word, which documents contain it?"
For every searchable word, we build a list of documents (and where in those documents the word appears, and how often).
"machine" → [
Doc 1 ("Intro to Machine Learning"): pages [2, 88, 108], frequency: high
Doc 2 ("Art of Cooking"): page [20], frequency: low
]
"learning" → [
Doc 1: pages [2, 14, 88], frequency: high
Doc 5: pages [3], frequency: medium
]
When the user searches "machine learning", we look up both words in the index — instantly — and intersect the results. We never scan a single document.
This data structure is what makes Google possible. It's also what makes search fast in PostgreSQL (via tsvector), Elasticsearch, MongoDB Atlas Search, and every other modern search engine.
Relevance Scoring — BM25
Once you have matching documents, you need to rank them. "Most relevant first" isn't obvious — what does relevance even mean?
BM25 (Best Match 25) is the de-facto industry-standard relevance algorithm, used by Elasticsearch, Solr, and Lucene. It scores each document on three intuitions:
1. Term frequency — A document mentioning "machine learning" 20 times is probably more relevant than one mentioning it once. But not 20× more relevant — there's diminishing returns. (BM25 caps how much repetition helps.)
2. Inverse document frequency — A common word like "the" appearing in your document doesn't tell us much. A rare word like "transformer" is much more telling. BM25 weights rare words higher than common ones.
3. Document length normalization — A 50,000-word book that mentions "machine learning" 10 times is less focused than a 500-word article that mentions it 10 times. BM25 normalizes for length.
The result: when you search "machine learning", a focused article with the exact phrase in the title outranks a long unrelated document that happens to mention the words in passing.
Elasticsearch — The Industry Standard
Elasticsearch wraps Apache Lucene (the inverted-index library) with distributed clustering, REST APIs, and a query DSL. It's the default for most production search.
Here's what a basic boosted query looks like:
{
"query": {
"multi_match": {
"query": "machine learning",
"fields": ["title^3", "subtitle^2", "description^1", "content"],
"fuzziness": "AUTO"
}
}
}
title^3 means matches in the title count 3× as much as matches in the body. So "Introduction to Machine Learning" outranks a cooking book that happens to mention machine learning in a footnote.
What Elasticsearch gives you on top of basic search:
- Fuzzy matching — "machien" still finds "machine" (handles typos via edit distance)
- Stemming — "running", "runs", and "ran" all match a search for "run"
- Stopword filtering — "the", "of", "a" are stripped before indexing
- Synonyms — "automobile" matches "car" with the right config
- Field boosting — title matches outrank body matches
- Aggregations / facets — "150 results, 80 in 'Tech', 50 in 'Science', 20 in 'Business'"
- Autocomplete — type-as-you-search suggestions
The Cost of Elasticsearch
Elasticsearch is powerful, but it isn't a database — and treating it like one will burn you.
It's eventually consistent. New documents you write don't appear in search results immediately. There's an indexing lag, typically 1 second by default. If you write a new product to your DB and immediately search for it, it won't be there.
It has weaker durability than a real database. PostgreSQL has WAL and crash recovery built into its DNA. Elasticsearch can lose recent writes during cluster rebalancing or node failures. Never use it as your primary store for anything you can't lose.
It's expensive to operate. Multi-node clusters, JVM tuning, shard rebalancing, index lifecycle management — there's a real ops cost. Cloud-managed offerings (Elastic Cloud, AWS OpenSearch) reduce this but cost money.
The right pattern almost always:
• PostgreSQL or your transactional database remains the source of truth
• A worker reads new/updated rows and indexes them into Elasticsearch (often via change-data-capture)
• User search queries hit Elasticsearch
• Once the user clicks a result, you fetch the canonical record from the primary database
This way: Postgres has durability, Elasticsearch has fast search, and the world stays sane.
When You Don't Need Elasticsearch
For many apps, Elasticsearch is overkill. Cheaper options:
PostgreSQL full-text search — Built right into Postgres. Surprisingly capable for moderate scale.
-- Add a tsvector column
ALTER TABLE articles ADD COLUMN search_vec tsvector
GENERATED ALWAYS AS (
to_tsvector('english', title || ' ' || body)
) STORED;
-- Index it
CREATE INDEX articles_search_idx ON articles USING GIN(search_vec);
-- Query
SELECT title, ts_rank(search_vec, query) AS rank
FROM articles, to_tsquery('english', 'machine & learning') query
WHERE search_vec @@ query
ORDER BY rank DESC
LIMIT 20;
This handles stemming, ranking, and is fast enough for hundreds of thousands of documents — maybe a few million with tuning. No new infrastructure.
SQLite FTS5 — If you're using SQLite (mobile apps, small services), FTS5 is built in and excellent.
Algolia / Meilisearch / Typesense — Hosted search-as-a-service. Easy to set up, fewer ops headaches than self-hosted Elasticsearch, but pricing scales with index size.
The decision framework: start with your database's built-in search. Move to a dedicated search engine only when you measurably outgrow it — when query times are too slow, when you need real fuzzy/synonyms/facets, or when the search workload is hurting your transactional database.
⁂ Back to all modules