2

Given the following table:

CREATE TABLE cnts(
  user_id INT,
  month_d DATE,
  cnt INT
)

I want to query cumulative counts for the last 6 months of each (user_id, month_d) pair. I can do it with the following JOIN:

SELECT
  S1.month_d AS "month_d",
  S1.user_id AS "user_id",
  SUM(S2.cnt) AS "last_6_months_cnt"
FROM cnts S1
LEFT JOIN cnts S2 ON S1.user_id = S2.user_id
                 AND (S2.month_d BETWEEN (S1.month_d - INTERVAL '5 MONTH') AND S1.month_d)
GROUP BY 1, 2
ORDER BY 2, 1;

But I wonder if this can be solved with window-functions?

Sample data:

INSERT INTO cnts(user_id, month_d, cnt) VALUES 
(1, '2013-01-01', 2),
(1, '2013-04-01', 2),
(1, '2013-07-01', 2),
(1, '2013-10-01', 2),

(2, '2013-01-01', 2),
(2, '2013-04-01', 2),
(2, '2013-07-01', 2),
(2, '2013-10-01', 2)
;

Expected results (from the join above):

  month_d   | user_id | last_6_months_cnt 
------------+---------+-------------------
 2013-01-01 |       1 |                 2
 2013-04-01 |       1 |                 4
 2013-07-01 |       1 |                 4
 2013-10-01 |       1 |                 4
 2013-01-01 |       2 |                 2
 2013-04-01 |       2 |                 4
 2013-07-01 |       2 |                 4
 2013-10-01 |       2 |                 4
huy
  • 4,782
  • 6
  • 36
  • 42
  • I think you left off some of the sample data, since the only user_id that appears is user_id 1. That sample data cannot possibly produce the result table shown. – Craig Ringer Mar 26 '13 at 08:15
  • Is there *always* a row for each `(user_id, month_d)` pair within the time range of interest? Or do you assume that missing rows have a count of zero? (If there's always a row then there are more efficient ways to do it). – Craig Ringer Mar 26 '13 at 08:21
  • @CraigRinger Sorry! The second batch of insert should have user_id = 2. Updated! – huy Mar 26 '13 at 08:22
  • @CraigRinger There's no guarantee. But I'd love to know the solution when there's always rows for each pair. – huy Mar 26 '13 at 08:23
  • Unfortunately the cleanest window function based solutions to this problem would require support for `RANGE` definitions in the window frame definition, and unfortunately PostgreSQL only supports `ROWS` at this point. It's still possible but it'd likely require a join on `generate_series` and I won't be surprised if it's no better. I'll still have a play, time permitting. – Craig Ringer Mar 26 '13 at 08:25

1 Answers1

4

PostgreSQL 12 and newer

UPDATE: PostgreSQL 12 and newer now support RANGE windows.

The correct way would be to use a window over RANGE (INTERVAL '6' MONTH) PRECEDING:

demo=> SELECT month_d, user_id, 
              SUM(cnt) OVER (PARTITION BY user_id ORDER BY month_d RANGE INTERVAL '6' MONTH PRECEDING)
       FROM cnts ORDER BY 2,1;

  month_d   | user_id | sum 
------------+---------+-----
 2013-01-01 |       1 |   2
 2013-04-01 |       1 |   4
 2013-07-01 |       1 |   6
 2013-10-01 |       1 |   6
 2013-01-01 |       2 |   2
 2013-04-01 |       2 |   4
 2013-07-01 |       2 |   6
 2013-10-01 |       2 |   6
(8 rows)

PostgreSQL 11 and older

On PostgreSQL 11 or older RANGE windows were not yet supported so the query will fail:

regress=> SELECT month_d, user_id, 
          SUM(cnt) OVER (PARTITION BY user_id ORDER BY month_d RANGE INTERVAL '6' MONTH PRECEDING) 
          FROM cnts ORDER BY 2,1;
ERROR:  RANGE PRECEDING is only supported with UNBOUNDED
LINE 1: ...(cnt) OVER (PARTITION BY user_id ORDER BY month_d RANGE INTE...

Without that you're going to be back with a join over generate_series, and doing that over multiple user ids cumbersome. I suspect your self-join approach is significantly preferable to attempting to do this with a ROWS based window over sum. You'd have to cross join a generate_series of the entire date range with the set of all distinct uids, then left outer join that against the cnts table, process that with sum over a window, and then filter out rows with null counts. Needless to say, this is a more tortured way of doing things than a simple self-join.


For your sample data the following query will produce the same result you have shown above:

-- This query is totally wrong and only works because of overly simple sample data
SELECT 
  month_d, user_id, 
  SUM(cnt) OVER (PARTITION BY user_id ORDER BY month_d ROWS 1 PRECEDING)
FROM cnts
ORDER BY 2,1;

However, it's totally wrong. I'm showing it primarily to illustrate that the sample data isn't good enough for solid testing, since the results match essentially by sheer luck. None of your samples have more than two samples within a six month range. Sample data is great, but you need to think about the corner cases, just like when you're writing unit tests. You should have uids that don't start and stop at the same dates, with different counts, etc.

Craig Ringer
  • 307,061
  • 76
  • 688
  • 778
  • Thanks a lot for the detailed attempt Craig. And you're right on the overly simple sample. Appreciate the advice! I assume if I actually have data for every `(user_id, month_d)` I can use `ROWS 5 PRECEDING` to solve this? – huy Mar 26 '13 at 08:57
  • @huy Yes, but missing values will result in silently wrong results. – Craig Ringer Mar 26 '13 at 09:03
  • 1
    @huy BTW, is this question is really the underlying problem "how do I do this faster", try posting that as a new question with a better sample data set and some `EXPLAIN ANALYZE` results from your real queries plus the other info mentioned on [tag:postgresql-performance] tag wiki. – Craig Ringer Mar 26 '13 at 11:52
  • 1
    As an FYI since this was the top Google Search for me when I searched, PostgreSQL (I have 11) does now support bounded `RANGE` windows. So the solution posted above will work. – Vincent Aug 26 '20 at 16:54
  • FYI, I believe this will work even on PostgreSQL 11, did you try it on 11 and fail? – Vincent Sep 01 '20 at 16:52