Aggregates, GROUP BY, HAVING — DBMS

What it is#

Aggregate functions collapse a set of rows into a single value: COUNT, SUM, AVG, MIN, MAX, plus a handful of statistical ones (STDDEV, VARIANCE, PERCENTILE_CONT). On their own they produce one row per query. Pair them with GROUP BY and they produce one row per group. HAVING filters those groups, post-aggregation, the way WHERE filters individual rows pre-aggregation.

The rule that catches new users: every column in the SELECT list either appears in the GROUP BY clause or is wrapped in an aggregate. Otherwise SQL has no way to choose which row’s value to show (the value of name for a group of 10 users — whose name?). Some engines (MySQL with ONLY_FULL_GROUP_BY off) silently pick any value; standard SQL and Postgres reject the query.

When to use it#

Counts and totals — SELECT COUNT(*) FROM orders WHERE ... is the universal “how many” query.
Per-group summaries — sales by region, signups by day, error rate by endpoint. The bread-and-butter analytics query.
Top-N per group with aggregates — when you want each group’s max / latest / sum.
De-duplication via aggregation — SELECT user_id, MAX(created_at) FROM events GROUP BY user_id collapses many rows per user to one.
Post-aggregate filtering — HAVING COUNT(*) > 10 to find groups meeting a size threshold.

Reach for window functions (separate writeup) when you want per-row aggregates without collapsing — “running total”, “rank within group”, “moving average”.

How it works#

The five basics#

SELECT
  COUNT(*)              AS total_rows,    -- counts rows, including NULLs
  COUNT(email)          AS with_email,    -- counts non-NULL emails
  COUNT(DISTINCT email) AS unique_emails, -- counts distinct non-NULL
  SUM(amount_cents)     AS total_cents,   -- sums non-NULL values; NULL if no rows
  AVG(amount_cents)     AS avg_cents,     -- average of non-NULL; NULL if no rows
  MIN(created_at)       AS first_seen,
  MAX(created_at)       AS last_seen
FROM orders;

NULL handling is uniform: aggregates skip NULLs in the input (except COUNT(*) which counts rows). An empty input returns NULL for SUM / AVG / MIN / MAX and 0 for COUNT. Use COALESCE(SUM(x), 0) if you want zero for empty groups.

GROUP BY#

SELECT region, COUNT(*) AS n, SUM(total_cents) AS revenue
FROM orders
GROUP BY region;

One row per distinct region. Every non-aggregate column in SELECT must appear in GROUP BY (or be functionally determined by what’s in GROUP BY — e.g., grouping by users.id lets you also select users.name since name is functionally dependent on id; Postgres recognizes this).

Multi-column grouping:

SELECT region, status, COUNT(*) AS n
FROM orders
GROUP BY region, status
ORDER BY region, n DESC;

One row per (region, status) pair.

HAVING#

WHERE filters rows before grouping. HAVING filters groups after grouping. Use the right one for the right job:

SELECT region, COUNT(*) AS n
FROM orders
WHERE status = 'completed'      -- filters rows
GROUP BY region
HAVING COUNT(*) > 100;           -- filters groups

WHERE COUNT(*) > 100 is illegal — COUNT(*) doesn’t exist before grouping. HAVING status = 'completed' works but evaluates after grouping, on more rows than necessary; always push to WHERE when the predicate references non-aggregated columns.

COUNT(*) vs COUNT(col) vs COUNT(DISTINCT col)#

This is the most-asked aggregate distinction in interviews:

SELECT
  COUNT(*)              AS row_count,        -- includes rows with all-NULL
  COUNT(col)            AS non_null_count,   -- excludes rows where col IS NULL
  COUNT(DISTINCT col)   AS distinct_count    -- excludes NULL + dedupes
FROM t;

Use COUNT(*) for “how many rows match my WHERE clause” — the default. Use COUNT(col) only when you specifically want non-NULL. Use COUNT(DISTINCT col) for unique counts, knowing it’s typically the most expensive (requires a sort or hash).

GROUPING SETS, ROLLUP, CUBE#

For multi-level aggregates in one query:

SELECT region, status, COUNT(*) AS n
FROM orders
GROUP BY ROLLUP (region, status);
-- Returns: (region, status), (region, NULL), (NULL, NULL) — subtotals + grand total

ROLLUP produces hierarchical subtotals. CUBE produces every combination. GROUPING SETS lets you pick arbitrary combinations. Powerful for reports; less common in app code.

Variants#

FILTER (WHERE ...) clause — per-aggregate filter, more readable than CASE inside aggregates:

SELECT
  COUNT(*) FILTER (WHERE status = 'paid') AS paid_count,
  COUNT(*) FILTER (WHERE status = 'failed') AS failed_count
FROM orders;

array_agg / string_agg / json_agg — collect group rows into an array / string / JSON. Postgres has the full set.
PERCENTILE_CONT / PERCENTILE_DISC — median, p95, p99. Use within a WITHIN GROUP (ORDER BY ...) clause.
bool_or / bool_and — boolean aggregates; useful for “any row in this group is X”.
mode() WITHIN GROUP (ORDER BY col) — most common value per group.

Trade-offs#

Aggregate at query time — fresh, accurate, no extra storage. Cost: scales with the rows you’re aggregating; a COUNT(*) over a billion-row table is always seconds-to-minutes. Right for low-frequency analytics, ad hoc queries, dashboards that tolerate latency.

Materialized aggregates (counter tables, materialized views, rollups) — precomputed, sub-millisecond reads, stale by the refresh latency. Cost: storage, maintenance, the staleness window. Right for high-traffic dashboards, real-time counters, anything that needs O(1) reads.

Other practical considerations:

COUNT(*) vs COUNT(1) — identical. Every modern planner treats them the same. Use COUNT(*); it’s clearer.
COUNT(DISTINCT) is expensive — requires sorting or hashing every value in the group. For approximate counts at scale, use HyperLogLog (approx_count_distinct in some engines, hll extension in Postgres).
Aggregate over a JOIN that multiplies rows — COUNT(o.id) joined to order_items counts items per group, not orders. Be precise about which column you’re counting; aggregate before joining if needed.
Indexes for GROUP BY — an index on the grouping column lets the engine do a sorted “group aggregate”; otherwise it builds a hash table. The choice is the planner’s, based on size and selectivity.

Common pitfalls#

Selecting non-aggregated non-grouped columns. Postgres rejects the query; MySQL (pre-ONLY_FULL_GROUP_BY) silently picks a random row. Always group every selected non-aggregate column.
HAVING for row filters. Works, but evaluates post-aggregation. Push row filters to WHERE for performance.
Forgetting that aggregates skip NULL. AVG(col) averages non-NULL values; if you want NULL to count as 0, use COALESCE(col, 0) inside the aggregate.
SUM of an empty set is NULL, not 0. Wrap in COALESCE(SUM(...), 0) for arithmetic that needs a zero.
COUNT(col) when you meant COUNT(*). A column that’s mostly non-NULL behaves the same, then one day someone adds a NULL row and the count drops by one.
Multiplying rows via JOIN before aggregating. users LEFT JOIN orders then SUM(users.lifetime_value) overcounts users by their order count. Aggregate orders first (CTE or subquery), then join.
GROUP BY on a TIMESTAMPTZ at the wrong granularity — grouping by raw timestamps gives one row per microsecond; you want date_trunc('day', created_at) for a per-day rollup.
Mistaking COUNT for SUM in conditional aggregates. COUNT(status = 'paid') counts rows where the expression is non-NULL — which is every row. Use SUM(CASE WHEN status = 'paid' THEN 1 ELSE 0 END) or COUNT(*) FILTER (WHERE status = 'paid').

The aggregation patterns to memorize

Five idioms cover 80% of analytics SQL. (1) Group-and-count: SELECT key, COUNT(*) FROM t GROUP BY key. (2) Conditional count via FILTER: COUNT(*) FILTER (WHERE pred). (3) Per-group latest: SELECT key, MAX(created_at) FROM t GROUP BY key, then re-join to fetch the rest of the row. (4) Top-N per group: window function (ROW_NUMBER() OVER (PARTITION BY key ORDER BY ts DESC)) — better than the GROUP-BY-and-rejoin dance. (5) Percentile: PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY duration_ms) OVER (). The first four show up daily; the fifth is the answer to “what’s the p95 latency”.