Redshift aggregate grouped data by date range

Question

I have following table that contains quantities of items per day.

ID   Date        Item   Count
-----------------------------
1    2022-01-01  Milk   10
2    2022-01-11  Milk   20
3    2022-01-12  Milk   10
4    2022-01-15  Milk   12
5    2022-01-16  Milk   10
6    2022-01-02  Bread  20
7    2022-01-03  Bread  22
8    2022-01-05  Bread  24
9    2022-01-08  Bread  20
10   2022-01-12  Bread  10

I want to aggregate (sum, avg, ...) the quantity per item for the last 7 days (or 14, 28 days). The expected outcome would look like this table.

ID   Date        Item   Count  Sum_7d
-------------------------------------
1    2022-01-01  Milk   10     10
2    2022-01-11  Milk   20     20
3    2022-01-12  Milk   10     30
4    2022-01-15  Milk   12     42
5    2022-01-16  Milk   10     52
6    2022-01-02  Bread  20     20
7    2022-01-03  Bread  22     42
8    2022-01-05  Bread  24     66
9    2022-01-08  Bread  10     56
10   2022-01-12  Bread  10     20

My first approach was using Redshift window functions like this

SELECT *, SUM(Count) OVER (PARTITION BY Item
                           ORDER BY Date
                           ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS Sum_7d
FROM my_table

but it does not give the expected results because there are missing dates and I could not figure out how to put a condition on the time range.

My fallback solution is a cross product, but that's not desirable because it is inefficient for large data.

SELECT l.Date, l.Item, l.Count, sum(r.Count) as Sum_7d
FROM my_table l,
     my_table r
WHERE l.Date - r.Date < 7
  AND l.Date - r.Date >= 0
  AND l.Item = r.Item
GROUP BY 1, 2, 3

Is there any efficient and concise way to do such an aggregation on date ranges in Redshift?

score 0 · Answer 1 · edited Nov 22 '22 at 07:58

This is a missing data problem and a common way to "fill in the blanks" is with a cross join. You correctly point out that this can get very expensive because the cross joining (usually) massively expands the data being worked upon AND because Redshift isn't great at creating data. But you do have to fill in the missing data. The best way I have found is to create the (near) minimum data set that will complete the data then UNION this data to the original table. The code below performs this path.

There is a way to do this w/o adding rows but the SQL is large, inflexible, error prone and just plain ugly. You could create new columns (date and count) based on LAG(6), LAG(5), LAG(4) ... and compare the date of each and use the count if the date is truly in range. If you want to sum a different date look-back you need to add columns and things get uglier. Also this will only be faster that the code below for certain circumstances (very few repeats of item). It just replaces making new data in rows for making new data in columns. So don't go this way unless absolutely necessary.

Now to what I think will work for you. You need a dummy row for every date and item combination that doesn't already exist. This is the minimal set of new data that will make you window function work. In reality I make all the combinations of data and item and merge these with the existing - a slight compromise from the ideal.

First let's set up your data. I changed some names as using reserved words for column names is not ideal.

create table test (ID int, dt date, Item varchar(16), Cnt int);

insert into test values
(1, '2022-01-01', 'Milk', 10),
(2, '2022-01-11', 'Milk', 20),
(3, '2022-01-12', 'Milk', 10),
(4, '2022-01-15', 'Milk', 12),
(5, '2022-01-16', 'Milk', 10),
(6, '2022-01-02', 'Bread', 20),
(7, '2022-01-03', 'Bread', 22),
(8, '2022-01-05', 'Bread', 24),
(9, '2022-01-08', 'Bread', 20),
(10, '2022-01-12', 'Bread', 10);

The SQL for generating what you want is:

with recursive dates(dt) as
( select min(dt) as dt
  from test
  union all
  select dt + 1
  from dates d
  where d.dt <= current_date 
  )
select * 
from (
  SELECT *, SUM(Cnt) OVER (PARTITION BY Item
                           ORDER BY Dt
                           ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS Sum_7d
  FROM (
    select min(id) as id, dt, item, sum(cnt) as cnt 
    from (
      select * 
      from test
      union all 
      select NULL as id, dt, item, NULL as cnt
      from ( select distinct item from test) as items
      cross join dates
      ) as all_item_dates
    group by dt, item
  ) as grouped
) as windowed
where id is not null
order by id, dt;

Quickly here what this does.

A recursive CTE creates the date range in question (from min date in table until today).
These dates are cross joined with the distinct list of items resulting in every date for every unique item.
This is UNIONed to the table so all data exists.
GROUP By is used to merge real data rows with dummy rows for the same item and date.
Your window function is run.
A surrounding SELECT has a WHERE clause to remove any dummy rows.

As you will note this does use a cross-join but on a much reduced set of data (just the unique item list). As long as this distinct list of items is much shorter than the size of the table (very likely) then this will perform much faster than other techniques. Also if this is the kind of data you have you might find interest in this post I wrote - http://wad-design.s3-website-us-east-1.amazonaws.com/sql_limits_wp_2.html

Redshift aggregate grouped data by date range

1 Answers1