0

Given a table containing 1 line per day dt per product_id and the turnover generated this day, how to compute efficiently a column turnover_7day containing the turnover generated by this product over the last 7 days?

I found a simple query working as expected, but it's very slow and I'm trying to run the query for millions of products over the course of several years.

SQL Fiddle
(even if the Fiddle is Postgresql, IRL I'm trying to do this on Snowflake ; I doubt there are features available in Snowflake that can completely change a potential answer to this post)

The dataset:

TABLE turnover_per_day:
| product_id | product_name |         dt | turnover |
|------------|--------------|------------|----------|
|          1 |          PS5 | 2021-10-22 |       85 |
|          1 |          PS5 | 2021-10-27 |      100 |
|          1 |          PS5 | 2021-11-01 |      110 |
|          1 |          PS5 | 2021-11-05 |      150 |
|          2 |         XBOX | 2021-11-02 |       10 |
|          2 |         XBOX | 2021-11-03 |       15 |
|          2 |         XBOX | 2021-11-04 |       13 |
|          2 |         XBOX | 2021-11-05 |       11 |

Method 1: subquery in the SELECT statement (producing the expected result, very inefficient):

I'm using here a subquery on the source table in order to recompute the turnover. It seems really inefficient but at least it's easy to understand.

Query:

SELECT
    t1.product_id
    ,t1.product_name
    ,t1.turnover
    ,t1.dt
    ,(
        SELECT SUM(turnover) FROM turnover_per_day t2
        WHERE (t2.dt BETWEEN t1.dt - interval '6 day' AND t1.dt) and t1.product_id=t2.product_id
    ) as turnover_7day
FROM turnover_per_day as t1
order by product_id, t1.dt

Result (as intended):

| product_id | product_name | turnover |         dt | turnover_7day |
|------------|--------------|----------|------------|---------------|
|          1 |          PS5 |       85 | 2021-10-22 |            85 |
|          1 |          PS5 |      100 | 2021-10-27 |           185 |
|          1 |          PS5 |      110 | 2021-11-01 |           210 |
|          1 |          PS5 |      150 | 2021-11-05 |           260 |
|          2 |         XBOX |       10 | 2021-11-02 |            10 |
|          2 |         XBOX |       15 | 2021-11-03 |            25 |
|          2 |         XBOX |       13 | 2021-11-04 |            38 |
|          2 |         XBOX |       11 | 2021-11-05 |            49 |

Method 2: trying to reproduce this answer (but failing)

Here I'm trying to use window function in order to speed up the computation. I tried to add PARTITION BY product_id on the answer linked above but it doesn't work out as it should. My thinking is because of the LEFT JOIN values of the column product_id are NULL and when taking 6 previous ROW, it 'deletes' the NULL row hence summing over more than 6 days.

Query:

with days as ( -- generate a calendar without gap 
  SELECT date_trunc('day', d)::date as day
  FROM generate_series(CURRENT_DATE-15, CURRENT_DATE, '1 day'::interval) d
 )
select
    days.day
    ,t1.product_id
    ,t1.product_name
    ,t1.turnover
    ,t1.dt
    ,SUM(t1.turnover) OVER (PARTITION BY t1.product_id ORDER BY dt ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS turnover_7day

FROM days
LEFT JOIN turnover_per_day as t1
ON days.day = t1.dt
--where t1.product_id is not null
order by product_id, dt

Result (wrong):

|        day | product_id | product_name | turnover |         dt | turnover_7day |
|------------|------------|--------------|----------|------------|---------------|
| 2021-10-22 |          1 |          PS5 |       85 | 2021-10-22 |            85 |
| 2021-10-27 |          1 |          PS5 |      100 | 2021-10-27 |           185 |
| 2021-11-01 |          1 |          PS5 |      110 | 2021-11-01 |           295 |
| 2021-11-05 |          1 |          PS5 |      150 | 2021-11-05 |           445 |
| 2021-11-02 |          2 |         XBOX |       10 | 2021-11-02 |            10 |
| 2021-11-03 |          2 |         XBOX |       15 | 2021-11-03 |            25 |
| 2021-11-04 |          2 |         XBOX |       13 | 2021-11-04 |            38 |
| 2021-11-05 |          2 |         XBOX |       11 | 2021-11-05 |            49 |
| 2021-10-31 |     (null) |       (null) |   (null) |     (null) |        (null) |
| 2021-10-29 |     (null) |       (null) |   (null) |     (null) |        (null) |
| 2021-10-23 |     (null) |       (null) |   (null) |     (null) |        (null) |
| 2021-10-24 |     (null) |       (null) |   (null) |     (null) |        (null) |
| 2021-10-25 |     (null) |       (null) |   (null) |     (null) |        (null) |
| 2021-10-26 |     (null) |       (null) |   (null) |     (null) |        (null) |
| 2021-10-28 |     (null) |       (null) |   (null) |     (null) |        (null) |
| 2021-10-21 |     (null) |       (null) |   (null) |     (null) |        (null) |
| 2021-10-30 |     (null) |       (null) |   (null) |     (null) |        (null) |

My questions are:

  1. How to modify the Method 2 in order to make it work?
  2. Are there any other less data/computation-intensive way to calculate this rolling sum (group by product_id)?
politinsa
  • 3,480
  • 1
  • 11
  • 36

1 Answers1

0

Instead of using ROWS PRECEDING, which is counted in number of rows, you need to use RANGE PRECEDING, which is counted in units of the ORDER BY column.

select
    t1.product_id
    ,t1.product_name
    ,t1.turnover
    ,t1.dt
    ,SUM(t1.turnover) OVER (PARTITION BY t1.product_id ORDER BY dt RANGE BETWEEN '6 days' PRECEDING AND CURRENT ROW) AS turnover_7day
FROM turnover_per_day as t1
order by product_id, dt;

The ROW PRECEDING method could be made to work if you created the dense series of days for each product, which you are not doing in your code. But I doubt it would perform well.

jjanes
  • 37,812
  • 5
  • 27
  • 34
  • 1
    Neither Snowflake nor PostgreSQL have RANGE BETWEEN implemented for rolling window functions https://docs.snowflake.com/en/sql-reference/functions-analytic.html#window-frame-usage-notes – politinsa Nov 06 '21 at 00:03
  • I tested it in PostgreSQL before posting and it works there. Implemented in v11, it looks like. I wouldn't know about snowflake. Your first attempt also gives reasonable performance with the right index (again, in PostgreSQL--I still wouldn't know about snowflake) – jjanes Nov 06 '21 at 01:13