Given a table containing 1 line per day dt
per product_id
and the turnover
generated this day, how to compute efficiently a column turnover_7day
containing the turnover generated by this product over the last 7 days?
I found a simple query working as expected, but it's very slow and I'm trying to run the query for millions of products over the course of several years.
SQL Fiddle
(even if the Fiddle is Postgresql, IRL I'm trying to do this on Snowflake ; I doubt there are features available in Snowflake that can completely change a potential answer to this post)
The dataset:
TABLE turnover_per_day:
| product_id | product_name | dt | turnover |
|------------|--------------|------------|----------|
| 1 | PS5 | 2021-10-22 | 85 |
| 1 | PS5 | 2021-10-27 | 100 |
| 1 | PS5 | 2021-11-01 | 110 |
| 1 | PS5 | 2021-11-05 | 150 |
| 2 | XBOX | 2021-11-02 | 10 |
| 2 | XBOX | 2021-11-03 | 15 |
| 2 | XBOX | 2021-11-04 | 13 |
| 2 | XBOX | 2021-11-05 | 11 |
Method 1: subquery in the SELECT statement (producing the expected result, very inefficient):
I'm using here a subquery on the source table in order to recompute the turnover. It seems really inefficient but at least it's easy to understand.
Query:
SELECT
t1.product_id
,t1.product_name
,t1.turnover
,t1.dt
,(
SELECT SUM(turnover) FROM turnover_per_day t2
WHERE (t2.dt BETWEEN t1.dt - interval '6 day' AND t1.dt) and t1.product_id=t2.product_id
) as turnover_7day
FROM turnover_per_day as t1
order by product_id, t1.dt
Result (as intended):
| product_id | product_name | turnover | dt | turnover_7day |
|------------|--------------|----------|------------|---------------|
| 1 | PS5 | 85 | 2021-10-22 | 85 |
| 1 | PS5 | 100 | 2021-10-27 | 185 |
| 1 | PS5 | 110 | 2021-11-01 | 210 |
| 1 | PS5 | 150 | 2021-11-05 | 260 |
| 2 | XBOX | 10 | 2021-11-02 | 10 |
| 2 | XBOX | 15 | 2021-11-03 | 25 |
| 2 | XBOX | 13 | 2021-11-04 | 38 |
| 2 | XBOX | 11 | 2021-11-05 | 49 |
Method 2: trying to reproduce this answer (but failing)
Here I'm trying to use window function in order to speed up the computation.
I tried to add PARTITION BY product_id
on the answer linked above but it doesn't work out as it should.
My thinking is because of the LEFT JOIN
values of the column product_id
are NULL
and when taking 6 previous ROW, it 'deletes' the NULL row hence summing over more than 6 days.
Query:
with days as ( -- generate a calendar without gap
SELECT date_trunc('day', d)::date as day
FROM generate_series(CURRENT_DATE-15, CURRENT_DATE, '1 day'::interval) d
)
select
days.day
,t1.product_id
,t1.product_name
,t1.turnover
,t1.dt
,SUM(t1.turnover) OVER (PARTITION BY t1.product_id ORDER BY dt ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS turnover_7day
FROM days
LEFT JOIN turnover_per_day as t1
ON days.day = t1.dt
--where t1.product_id is not null
order by product_id, dt
Result (wrong):
| day | product_id | product_name | turnover | dt | turnover_7day |
|------------|------------|--------------|----------|------------|---------------|
| 2021-10-22 | 1 | PS5 | 85 | 2021-10-22 | 85 |
| 2021-10-27 | 1 | PS5 | 100 | 2021-10-27 | 185 |
| 2021-11-01 | 1 | PS5 | 110 | 2021-11-01 | 295 |
| 2021-11-05 | 1 | PS5 | 150 | 2021-11-05 | 445 |
| 2021-11-02 | 2 | XBOX | 10 | 2021-11-02 | 10 |
| 2021-11-03 | 2 | XBOX | 15 | 2021-11-03 | 25 |
| 2021-11-04 | 2 | XBOX | 13 | 2021-11-04 | 38 |
| 2021-11-05 | 2 | XBOX | 11 | 2021-11-05 | 49 |
| 2021-10-31 | (null) | (null) | (null) | (null) | (null) |
| 2021-10-29 | (null) | (null) | (null) | (null) | (null) |
| 2021-10-23 | (null) | (null) | (null) | (null) | (null) |
| 2021-10-24 | (null) | (null) | (null) | (null) | (null) |
| 2021-10-25 | (null) | (null) | (null) | (null) | (null) |
| 2021-10-26 | (null) | (null) | (null) | (null) | (null) |
| 2021-10-28 | (null) | (null) | (null) | (null) | (null) |
| 2021-10-21 | (null) | (null) | (null) | (null) | (null) |
| 2021-10-30 | (null) | (null) | (null) | (null) | (null) |
My questions are:
- How to modify the Method 2 in order to make it work?
- Are there any other less data/computation-intensive way to calculate this rolling sum (group by product_id)?