Pattern Search

LIKE, ILIKE, regex, trigram, full-text — the spectrum of find rows that match a string pattern.

Pattern Foundational
7 min read
sql sql-pattern search full-text indexing

What it is#

Pattern search is the family of SQL queries that ask “find rows whose text column matches a pattern.” The pattern can be a literal prefix, a sub-string, a wildcard expression, a regular expression, a fuzzy match, or a natural-language query against a full-text index. SQL gives you a spectrum of operators with very different cost models:

WHERE name LIKE 'Acme%' -- prefix, indexable
WHERE name LIKE '%acme%' -- substring, scans by default
WHERE name ILIKE '%Acme%' -- case-insensitive (Postgres)
WHERE name ~ '^Ac[me]+$' -- regex (Postgres)
WHERE name REGEXP '^Ac[me]+$' -- regex (MySQL)
WHERE MATCH(body) AGAINST('acme corp') -- full-text (MySQL)
WHERE to_tsvector(body) @@ to_tsquery('acme & corp') -- FTS (Postgres)

The cheapest operator is a LIKE with an anchored prefix on an indexed column — that’s an index range scan. The most expressive is a tokenised full-text index — that’s a posting-list lookup with ranking. Most performance bugs in this area come from picking an operator whose cost model the planner couldn’t satisfy with the available indexes.

When to use it#

Reach for pattern search when the question is “find text that looks like X”:

  • Type-ahead and autocomplete. LIKE 'query%' on an indexed name column. Cheap and accurate for prefixes.
  • Substring containment. Finding email addresses with a domain, log lines with an error code. Use LIKE '%pattern%' with a trigram index, or full-text for word-level matching.
  • Structured pattern matching. Validating that a column looks like a phone number or extracting a code — regex.
  • Natural-language search. Searching article bodies, product descriptions, comment threads — full-text search with stemming, stopwords, and ranking.
  • Fuzzy / typo-tolerant search. Names that might be misspelled — trigram similarity (pg_trgm on Postgres) or Levenshtein in extensions.

Reach for a real search engine (Elasticsearch, OpenSearch, Meilisearch) when you outgrow what the relational engine can do — multi-field ranking, faceting, synonyms, language-aware analyzers at scale. The boundary is usually around tens of millions of documents or when relevance tuning becomes the product.

How it works#

LIKE is the standard operator. Two wildcards: % matches any number of characters, _ matches one. ESCAPE '\' lets you match literal % and _.

-- Users whose email is on the acme.com domain
SELECT user_id, email
FROM users
WHERE email LIKE '%@acme.com';

The planner cannot use a plain B-tree index on email for that query because the pattern is right-anchored (the wildcard is on the left). An index range scan needs a prefix it can range over; %@acme.com has no fixed prefix. The plan degrades to a sequential scan.

Anchoring the wildcard on the right lets the index work:

SELECT user_id, email
FROM users
WHERE email LIKE 'alice%';
-- B-tree on (email) does a range scan: [alice, alicf) and stops.

For substring search on indexed columns, Postgres has trigram indexes via the pg_trgm extension:

CREATE EXTENSION pg_trgm;
CREATE INDEX users_email_trgm ON users USING gin (email gin_trgm_ops);
-- Now this uses the index:
SELECT user_id, email FROM users WHERE email LIKE '%@acme.com';
SELECT user_id, email FROM users WHERE email ILIKE '%acme%';

The trigram index breaks each value into overlapping three-character sequences and indexes them as a GIN posting list. Substring queries match the trigrams of the pattern. There’s overhead — the index is larger than a B-tree and writes are slower — but it makes the previously un-indexable queries scale.

For word-level text search, both major engines have full-text indexes.

-- Postgres
CREATE INDEX articles_body_fts ON articles
USING gin (to_tsvector('english', body));
SELECT id, title
FROM articles
WHERE to_tsvector('english', body) @@ to_tsquery('english', 'database & index')
ORDER BY ts_rank(to_tsvector('english', body), to_tsquery('english', 'database & index')) DESC
LIMIT 20;
-- MySQL
ALTER TABLE articles ADD FULLTEXT INDEX (body);
SELECT id, title, MATCH(body) AGAINST('database index' IN NATURAL LANGUAGE MODE) AS score
FROM articles
WHERE MATCH(body) AGAINST('database index' IN NATURAL LANGUAGE MODE)
ORDER BY score DESC
LIMIT 20;

Both engines tokenise the text, drop stopwords, optionally stem, and index the resulting word list. Queries match against the analysed form and return ranked results. The cost model is “posting-list intersection,” roughly proportional to the rarest term — not the table size.

Variants#

Case sensitivity. Postgres LIKE is case-sensitive; ILIKE is the case-insensitive twin. MySQL’s LIKE follows the column’s collation — utf8mb4_general_ci is case-insensitive, utf8mb4_bin is sensitive. SQL Server: collation determines both. SQLite: LIKE is ASCII-case-insensitive by default; GLOB is case-sensitive.

Regex. Postgres uses ~ (case-sensitive), ~* (case-insensitive), !~ (negated). MySQL: REGEXP / RLIKE, case-insensitive by default. Oracle: REGEXP_LIKE(col, pattern). SQL Server has no native regex — LIKE patterns or CLR.

Trigram similarity. pg_trgm’s % operator (different from the % wildcard) does “similarity ≥ threshold,” useful for typo tolerance: WHERE name % 'Jhon' finds “John” via shared trigrams.

tsvector columns. Materialise the analysed form as a generated column to avoid re-analysing on every query:

ALTER TABLE articles
ADD COLUMN body_tsv tsvector
GENERATED ALWAYS AS (to_tsvector('english', body)) STORED;
CREATE INDEX articles_body_tsv ON articles USING gin (body_tsv);
SELECT id FROM articles WHERE body_tsv @@ to_tsquery('english', 'database');

Engine-specific full-text features. Postgres FTS supports phrase queries (<->), proximity, weight classes (title vs body), and language configurations. MySQL InnoDB FTS supports boolean mode (+required -excluded ~deprioritized) and natural-language ranking. SQL Server has CONTAINS and FREETEXT against full-text catalogs.

Common pitfalls#

  • Leading wildcard kills B-tree index. LIKE '%foo' and LIKE '%foo%' cannot use a normal B-tree. Either add a trigram index, change the query to a prefix match, or store a reversed copy of the column for suffix queries (LIKE 'oof%' on reverse(col)).
  • Collation surprises. A column declared with utf8mb4_general_ci matches case-insensitively even when you wrote LIKE 'Acme%'. Verify with SHOW CREATE TABLE before debugging missed matches.
  • ILIKE doesn’t use B-tree on Postgres. ILIKE 'acme%' cannot use a plain B-tree because it would have to scan all case variants. Either index lower(col) and query WHERE lower(col) LIKE 'acme%', or use a trigram index, or use a citext column.
  • Regex catastrophic backtracking. ^(a+)+$ against a long string of as with a final b takes exponential time. Most engines run an NFA-based matcher that can backtrack pathologically. Avoid nested quantifiers; use anchored, atomic groups when supported.
  • Full-text stopwords. “to be or not to be” returns nothing in default English FTS because every word is a stopword. Either change configuration or use a different operator for short common-word queries.
  • Mixing LIKE and full-text expectations. LIKE '%database%' matches “databases” because it’s a substring; FTS with English stemming matches “database” → “databases” via stemming but might miss “DB”. They feel similar but have different boundaries.
  • % and _ as literals. Searching for “50% off” with LIKE '%50% off%' does the wrong thing — the middle % is a wildcard. Escape it: LIKE '%50\% off%' ESCAPE '\'.
  • Trigram on short patterns. Trigrams need ≥3 characters to be useful. LIKE '%ab%' falls back to a scan even with a trigram index.

Practice problem#

You have a 100M-row articles(id, title, body, published_at) table. The product team wants a search box that finds articles whose body contains a phrase the user types, ordered by recency. The query needs to return in under 200 ms p95. Walk through which operator and index you’d pick, and explain why a plain LIKE '%phrase%' won’t meet the latency target.

Solution

Use Postgres full-text search with a stored tsvector column and a GIN index, then order by recency from a B-tree:

ALTER TABLE articles
ADD COLUMN body_tsv tsvector
GENERATED ALWAYS AS (to_tsvector('english', body)) STORED;
CREATE INDEX articles_body_tsv ON articles USING gin (body_tsv);
CREATE INDEX articles_published_at ON articles (published_at DESC);
SELECT id, title, published_at
FROM articles
WHERE body_tsv @@ phraseto_tsquery('english', :query)
ORDER BY published_at DESC
LIMIT 50;

LIKE '%phrase%' does a sequential scan of all 100M rows — single-digit minutes at best. The GIN index reduces the candidate set to documents containing the phrase tokens (typically thousands), and the B-tree on published_at lets the planner pick a BitmapAnd plan or a merge with the recency index. If recency dominates over relevance, an index on (published_at DESC) WHERE body_tsv @@ ... isn’t possible, so the planner does a bitmap heap scan; a covering index on published_at INCLUDE (id, title) keeps the sort cheap.

Search ESC

Keyboard shortcuts

Shortcuts are disabled while typing in inputs.