Group Bucket — DBMS · Engineering Playbook

What it is#

The group bucket pattern turns a continuous numeric column into a small set of named categories, then aggregates per category. The grouping key is a CASE expression, not a bare column.

SELECT   CASE
           WHEN age < 18  THEN 'minor'
           WHEN age < 65  THEN 'adult'
           ELSE                'senior'
         END AS age_band,
         COUNT(*) AS users
FROM     users
GROUP BY 1
ORDER BY MIN(age);

It is the Tally Count pattern with a derived grouping column. Reach for it when the raw values are too many to tally usefully (every distinct age, every distinct price) but the ranges are interesting.

The pattern shows up everywhere histograms do: latency distribution into p50/p90/p99 buckets, order totals into tiers, ages into bands, response sizes into K/M/G ranges, churn risk into low/medium/high.

-- Order amount tiers
SELECT   CASE
           WHEN amount <    10 THEN '<$10'
           WHEN amount <   100 THEN '$10-$100'
           WHEN amount <  1000 THEN '$100-$1000'
           ELSE                     '$1000+'
         END AS tier,
         COUNT(*) AS orders,
         SUM(amount) AS revenue
FROM     orders
GROUP BY 1
ORDER BY MIN(amount);

When to use it#

Use group bucket when:

The column is continuous (numeric, float, date, interval) or has high cardinality discrete values you want to collapse (every individual age, every product price).
The bucket boundaries are business-meaningful (age bands, price tiers, latency SLOs) — not arbitrary equal-count splits.
The number of buckets is small (3–10). Above that, prefer a histogram with a numeric width or a window-function bucketing helper.

If the buckets need to be equal-count (each containing the same number of rows) rather than business-defined ranges, use NTILE(n) OVER (ORDER BY col) instead — see the Window functions writeup for an example. If the boundaries are equal-width and you have many of them, use WIDTH_BUCKET() (Postgres, Oracle) or floor-divide arithmetic.

How it works#

The pattern has four moving parts.

A CASE WHEN expression that maps each row’s continuous value to a discrete label.
An ORDER BY 1 or ORDER BY MIN(col) clause so the buckets come back in their natural order — alphabetical bucket labels rarely match the numeric order.
Per-bucket aggregates — usually COUNT(*), often paired with SUM, AVG, or MIN/MAX of the underlying column.
A catch-all ELSE branch so no row falls outside the defined buckets.

SELECT   CASE
           WHEN duration_ms <    50 THEN 'fast'
           WHEN duration_ms <   200 THEN 'normal'
           WHEN duration_ms <  1000 THEN 'slow'
           ELSE                        'very_slow'
         END AS latency_bucket,
         COUNT(*)         AS requests,
         AVG(duration_ms) AS avg_ms,
         MIN(duration_ms) AS min_ms,
         MAX(duration_ms) AS max_ms
FROM     requests
WHERE    received_at >= NOW() - INTERVAL '1 hour'
GROUP BY 1
ORDER BY MIN(duration_ms);

Why ORDER BY MIN(col) is a useful trick. Bucket labels like '$10-$100', '$1000+', '<$10' sort wrong alphabetically — the dollar sign and digits don’t produce the order you want. ORDER BY MIN(amount) orders by the smallest underlying value in each bucket, which is always the correct order. Alternatives: add a numeric sort_key column to the CASE and sort by that, or rely on ORDER BY 1 if your labels happen to sort right ('a-low', 'b-med', 'c-high').

Variants#

Equal-width buckets with arithmetic. When the boundaries are evenly spaced, integer division produces the bucket number directly.

-- Buckets of width 10: 0-9, 10-19, 20-29, ...
SELECT   (age / 10) * 10 AS age_decade,
         COUNT(*) AS users
FROM     users
GROUP BY 1
ORDER BY 1;

WIDTH_BUCKET() (Postgres, Oracle). Built-in for equal-width bucketing across a range. Returns 0 for values below the range and count + 1 for values above.

-- 10 equal-width buckets from 0 to 1000
SELECT   WIDTH_BUCKET(amount, 0, 1000, 10) AS bucket,
         COUNT(*) AS orders
FROM     orders
GROUP BY 1
ORDER BY 1;

NTILE(n) for equal-count buckets. Splits the ordered rows into n groups of as-equal-as-possible size — the quantile approach.

-- Quartile each customer by lifetime spend
SELECT   customer_id,
         total_spend,
         NTILE(4) OVER (ORDER BY total_spend) AS quartile
FROM     customer_spend;

PERCENTILE_CONT and PERCENTILE_DISC. Compute exact quantile boundaries, then bucket against them in a second pass.

-- Postgres / Oracle / SQL Server 2012+
SELECT PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY duration_ms) AS p50
FROM requests;

In BigQuery: APPROX_QUANTILES(duration_ms, 100) returns 101 boundaries. In MySQL: no built-in, simulate with ROW_NUMBER + count or a UDF.

Bucket by date arithmetic. Time buckets are group bucket with DATE_TRUNC (Postgres, BigQuery) or DATE_FORMAT (MySQL).

-- Postgres
SELECT   DATE_TRUNC('week', created_at) AS week, COUNT(*) AS signups
FROM     users
GROUP BY 1
ORDER BY 1;

-- MySQL
SELECT   DATE_FORMAT(created_at, '%Y-%u') AS week_iso, COUNT(*) AS signups
FROM     users
GROUP BY 1
ORDER BY 1;

Multi-dimensional buckets. Bucket two columns at once for a 2D histogram.

SELECT   age_band, plan,
         COUNT(*) AS users
FROM     (
  SELECT  CASE WHEN age < 25 THEN '18-24'
               WHEN age < 35 THEN '25-34'
               WHEN age < 50 THEN '35-49'
               ELSE               '50+'
          END AS age_band,
          plan
  FROM    users
) sub
GROUP BY age_band, plan
ORDER BY age_band, plan;

Common pitfalls#

Forgetting the catch-all ELSE. A CASE expression without ELSE defaults to ELSE NULL. Any row that doesn’t match a WHEN becomes part of the NULL bucket, which is easy to miss when you eyeball the result.

-- BUG: ages >= 65 quietly become NULL, hidden in the result
CASE WHEN age < 18 THEN 'minor'
     WHEN age < 65 THEN 'adult'
END

Always end with ELSE 'unknown' (or ELSE 'senior', matching your bands). When debugging an unexpected NULL bucket, look for a missing ELSE.

Off-by-one on boundaries. WHEN age <= 17 THEN 'minor' vs WHEN age < 18 THEN 'minor' mean the same thing only when age is an integer. For continuous types (float, decimal), age <= 17 excludes 17.5 from “minor”. Stick to strict-less < with the exclusive upper bound; it composes more cleanly when boundaries are adjusted.

Half-open intervals (preferred):

WHEN x <  10  THEN 'A'
WHEN x <  20  THEN 'B'
WHEN x <  30  THEN 'C'
ELSE              'D'

Each bucket is [lower, upper). No gaps, no overlaps, regardless of whether x is integer or float.

Closed-closed intervals (error-prone):

WHEN x BETWEEN  0 AND 10 THEN 'A'
WHEN x BETWEEN 10 AND 20 THEN 'B'  -- overlaps at 10
WHEN x BETWEEN 20 AND 30 THEN 'C'

BETWEEN is inclusive on both sides. The WHEN clauses evaluate top-to-bottom, so x = 10 becomes 'A', not 'B' — but the intent is unclear and the bug surfaces on the first refactor.

Sort order on bucket labels. As covered above, '$10-$100' and '$1000+' sort wrong alphabetically. Use ORDER BY MIN(col) or add a sort_key.

NULL in the bucketed column. A row with age IS NULL matches no WHEN predicate (every comparison with NULL is NULL, not TRUE), so it falls through to ELSE. If ELSE is 'senior', your “senior” bucket secretly includes everyone with missing age. Add an explicit WHEN age IS NULL THEN 'unknown' at the top.

Bucket boundary drift across queries. If two team members write two reports with <25 / 25-34 / 35+ vs <25 / 25-35 / 35+, the “25-34” and “25-35” buckets disagree on row age = 35. Encode the boundaries once in a CTE or a view and reference it everywhere.

WITH bucketed AS (
  SELECT *,
         CASE WHEN age < 25 THEN '18-24'
              WHEN age < 35 THEN '25-34'
              WHEN age < 50 THEN '35-49'
              ELSE              '50+'
         END AS age_band
  FROM users
)
SELECT age_band, COUNT(*) FROM bucketed GROUP BY age_band;

Performance: indexes don’t help bucketed grouping. A B-tree on age cannot satisfy GROUP BY age_band because the index is on the raw value, not the derived label. On large tables, materialise the bucket label as a generated column and index it, or denormalise it into the table.

Practice problem#

You have an orders table:

CREATE TABLE orders (
  id          BIGSERIAL PRIMARY KEY,
  user_id     BIGINT  NOT NULL,
  amount      NUMERIC NOT NULL,
  created_at  TIMESTAMPTZ NOT NULL
);

Bucket order totals into four tiers — under $10,$ 10 to $100,$ 100 to $1000, and$ 1000 or more — and return the count of orders and total revenue per tier, ordered from cheapest tier to most expensive.

Solution

SELECT   CASE
           WHEN amount <    10 THEN '<$10'
           WHEN amount <   100 THEN '$10-$100'
           WHEN amount <  1000 THEN '$100-$1000'
           ELSE                     '$1000+'
         END AS tier,
         COUNT(*)  AS orders,
         SUM(amount) AS revenue
FROM     orders
GROUP BY 1
ORDER BY MIN(amount);

Half-open intervals so the boundaries are unambiguous. ORDER BY MIN(amount) orders the buckets by their smallest underlying value, since the labels themselves do not sort correctly alphabetically.

Tally Count — the same skeleton with a bare grouping column instead of a CASE.
Rolling Totals — for moving averages and cumulative sums over time buckets.
Aggregates, GROUP BY, HAVING — the building block underneath.
Window functions — for NTILE, PERCENTILE_CONT, and equal-count bucketing.
SELECT and the WHERE clause — for the pre-bucket filtering that often pairs with this pattern.