Aggregates, GROUP BY, HAVING
COUNT / SUM / AVG / MIN / MAX, the GROUP BY contract, why HAVING is not WHERE.
What it is#
Aggregate functions collapse a set of rows into a single value: COUNT, SUM, AVG, MIN, MAX, plus a handful of statistical ones (STDDEV, VARIANCE, PERCENTILE_CONT). On their own they produce one row per query. Pair them with GROUP BY and they produce one row per group. HAVING filters those groups, post-aggregation, the way WHERE filters individual rows pre-aggregation.
The rule that catches new users: every column in the SELECT list either appears in the GROUP BY clause or is wrapped in an aggregate. Otherwise SQL has no way to choose which row’s value to show (the value of name for a group of 10 users — whose name?). Some engines (MySQL with ONLY_FULL_GROUP_BY off) silently pick any value; standard SQL and Postgres reject the query.
When to use it#
- Counts and totals —
SELECT COUNT(*) FROM orders WHERE ...is the universal “how many” query. - Per-group summaries — sales by region, signups by day, error rate by endpoint. The bread-and-butter analytics query.
- Top-N per group with aggregates — when you want each group’s max / latest / sum.
- De-duplication via aggregation —
SELECT user_id, MAX(created_at) FROM events GROUP BY user_idcollapses many rows per user to one. - Post-aggregate filtering —
HAVING COUNT(*) > 10to find groups meeting a size threshold.
Reach for window functions (separate writeup) when you want per-row aggregates without collapsing — “running total”, “rank within group”, “moving average”.
How it works#
The five basics#
SELECT COUNT(*) AS total_rows, -- counts rows, including NULLs COUNT(email) AS with_email, -- counts non-NULL emails COUNT(DISTINCT email) AS unique_emails, -- counts distinct non-NULL SUM(amount_cents) AS total_cents, -- sums non-NULL values; NULL if no rows AVG(amount_cents) AS avg_cents, -- average of non-NULL; NULL if no rows MIN(created_at) AS first_seen, MAX(created_at) AS last_seenFROM orders;NULL handling is uniform: aggregates skip NULLs in the input (except COUNT(*) which counts rows). An empty input returns NULL for SUM / AVG / MIN / MAX and 0 for COUNT. Use COALESCE(SUM(x), 0) if you want zero for empty groups.
GROUP BY#
SELECT region, COUNT(*) AS n, SUM(total_cents) AS revenueFROM ordersGROUP BY region;One row per distinct region. Every non-aggregate column in SELECT must appear in GROUP BY (or be functionally determined by what’s in GROUP BY — e.g., grouping by users.id lets you also select users.name since name is functionally dependent on id; Postgres recognizes this).
Multi-column grouping:
SELECT region, status, COUNT(*) AS nFROM ordersGROUP BY region, statusORDER BY region, n DESC;One row per (region, status) pair.
HAVING#
WHERE filters rows before grouping. HAVING filters groups after grouping. Use the right one for the right job:
SELECT region, COUNT(*) AS nFROM ordersWHERE status = 'completed' -- filters rowsGROUP BY regionHAVING COUNT(*) > 100; -- filters groupsWHERE COUNT(*) > 100 is illegal — COUNT(*) doesn’t exist before grouping. HAVING status = 'completed' works but evaluates after grouping, on more rows than necessary; always push to WHERE when the predicate references non-aggregated columns.
COUNT(*) vs COUNT(col) vs COUNT(DISTINCT col)#
This is the most-asked aggregate distinction in interviews:
SELECT COUNT(*) AS row_count, -- includes rows with all-NULL COUNT(col) AS non_null_count, -- excludes rows where col IS NULL COUNT(DISTINCT col) AS distinct_count -- excludes NULL + dedupesFROM t;Use COUNT(*) for “how many rows match my WHERE clause” — the default. Use COUNT(col) only when you specifically want non-NULL. Use COUNT(DISTINCT col) for unique counts, knowing it’s typically the most expensive (requires a sort or hash).
GROUPING SETS, ROLLUP, CUBE#
For multi-level aggregates in one query:
SELECT region, status, COUNT(*) AS nFROM ordersGROUP BY ROLLUP (region, status);-- Returns: (region, status), (region, NULL), (NULL, NULL) — subtotals + grand totalROLLUP produces hierarchical subtotals. CUBE produces every combination. GROUPING SETS lets you pick arbitrary combinations. Powerful for reports; less common in app code.
Variants#
FILTER (WHERE ...)clause — per-aggregate filter, more readable than CASE inside aggregates:SELECTCOUNT(*) FILTER (WHERE status = 'paid') AS paid_count,COUNT(*) FILTER (WHERE status = 'failed') AS failed_countFROM orders;array_agg/string_agg/json_agg— collect group rows into an array / string / JSON. Postgres has the full set.PERCENTILE_CONT/PERCENTILE_DISC— median, p95, p99. Use within aWITHIN GROUP (ORDER BY ...)clause.bool_or/bool_and— boolean aggregates; useful for “any row in this group is X”.mode() WITHIN GROUP (ORDER BY col)— most common value per group.
Trade-offs#
COUNT(*) over a billion-row table is always seconds-to-minutes. Right for low-frequency analytics, ad hoc queries, dashboards that tolerate latency. Other practical considerations:
COUNT(*)vsCOUNT(1)— identical. Every modern planner treats them the same. UseCOUNT(*); it’s clearer.COUNT(DISTINCT)is expensive — requires sorting or hashing every value in the group. For approximate counts at scale, use HyperLogLog (approx_count_distinctin some engines,hllextension in Postgres).- Aggregate over a JOIN that multiplies rows —
COUNT(o.id)joined toorder_itemscounts items per group, not orders. Be precise about which column you’re counting; aggregate before joining if needed. - Indexes for GROUP BY — an index on the grouping column lets the engine do a sorted “group aggregate”; otherwise it builds a hash table. The choice is the planner’s, based on size and selectivity.
Common pitfalls#
- Selecting non-aggregated non-grouped columns. Postgres rejects the query; MySQL (pre-
ONLY_FULL_GROUP_BY) silently picks a random row. Always group every selected non-aggregate column. HAVINGfor row filters. Works, but evaluates post-aggregation. Push row filters toWHEREfor performance.- Forgetting that aggregates skip NULL.
AVG(col)averages non-NULL values; if you want NULL to count as 0, useCOALESCE(col, 0)inside the aggregate. SUMof an empty set is NULL, not 0. Wrap inCOALESCE(SUM(...), 0)for arithmetic that needs a zero.COUNT(col)when you meantCOUNT(*). A column that’s mostly non-NULL behaves the same, then one day someone adds a NULL row and the count drops by one.- Multiplying rows via JOIN before aggregating.
users LEFT JOIN ordersthenSUM(users.lifetime_value)overcounts users by their order count. Aggregate orders first (CTE or subquery), then join. - GROUP BY on a TIMESTAMPTZ at the wrong granularity — grouping by raw timestamps gives one row per microsecond; you want
date_trunc('day', created_at)for a per-day rollup. - Mistaking COUNT for SUM in conditional aggregates.
COUNT(status = 'paid')counts rows where the expression is non-NULL — which is every row. UseSUM(CASE WHEN status = 'paid' THEN 1 ELSE 0 END)orCOUNT(*) FILTER (WHERE status = 'paid').
The aggregation patterns to memorize
Five idioms cover 80% of analytics SQL. (1) Group-and-count: SELECT key, COUNT(*) FROM t GROUP BY key. (2) Conditional count via FILTER: COUNT(*) FILTER (WHERE pred). (3) Per-group latest: SELECT key, MAX(created_at) FROM t GROUP BY key, then re-join to fetch the rest of the row. (4) Top-N per group: window function (ROW_NUMBER() OVER (PARTITION BY key ORDER BY ts DESC)) — better than the GROUP-BY-and-rejoin dance. (5) Percentile: PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY duration_ms) OVER (). The first four show up daily; the fifth is the answer to “what’s the p95 latency”.
Related building blocks#