Snowflake: Aggregating by a sliding window (the past 60 mins) for a dataset where the sampling frequency is non-uniform

Question

I have data with non-uniform sampling dist. I want to the aggregate data on a rolling/ sliding basis (the past 60 mins).

In order to achieve an hourly average (partitioned by city), I used to following code which worked.

SELECT *,
       AVG(VALUE) OVER (PARTITION BY CITY, DATE_AND_HOUR ORDER BY TIMESTAMP
FROM 
(
SELECT *,
       date_trunc('HOUR', TIMESTAMP) as DATE_AND_Hour

FROM SAMPLE_DATA
)

However, my desired output is as follows:

I know Snowflake doesn't support RANGE and I can't use specify which rows BETWEEN in a windows function as my sampling dist is non-uniform.

I read some potential solutions on this page but they don't work in snowflake: sum last n days quantity using sql window function Essentially, it's an analogous problem.

please try avoid posting pictures of data, instead format the data (or don't even, others can help) so people can cut'n'paste into the DB. typing out pictures is slow. — Simeon Pilgrim, Jun 23 '22 at 23:43

score 2 · Answer 1 · answered Jun 23 '22 at 19:46

2

You can solve this with a self-join:

with data as (
    select *
    from temp_fh_wikipedia.public.wikipedia_2020
    where title in ('San_Francisco', 'Los_Angeles')
    and wiki='en'
    and datehour > '2020-10-13'
)

select a.title, a.datehour, a.views, avg(b.views) avg_previous_5h
from data a
join (
    select *
    from data
) b
on a.title=b.title
and b.datehour between timestampadd(hour, -5, a.datehour) and a.datehour

group by 1, 2, 3
order by 1, 2
limit 100

Just change 'hour' for 'minutes', if you want the last x minutes.

answered Jun 23 '22 at 19:46

Felipe Hoffa

54,922
16
151
325

Thanks for this. Just one question: what's the avg_previous_5h in this example you've given? I can't see it defined anywhere in the code. And avg(b.views) would needs to be a windows function partitioned by title, right? – YoungboyVBA Jun 23 '22 at 20:14
In my case, the timestamps aren't at regular intervals. Not sure how I can write equivalent code for the column you've written as 'avg(b.views) avg_previous_5h', since Snowflake doesn't support range, – YoungboyVBA Jun 23 '22 at 20:36
1

@YoungboyVBA ` avg_previous_5h` is the average values across values of the last 5 hours. This is because the self join only includes values that are within the last five hours (see the last clause of the JOIN ON). It has a bug in which is there are no other rows inside the data window, the JOIN will drop the left hand side, and you will have missing data, thus it really should be a LEFT JOIN. The average does not need a window, as it is run across the GROUP BY, thus is all values of the interval (except in the case of only 1 row as noted). – Simeon Pilgrim Jun 23 '22 at 23:38

score 1 · Accepted Answer · answered Jun 24 '22 at 00:10

Firstly what you show as "average" in your example is the "sum", and you first "Shanghia" result is including a "Beijing" result.

You have two options, build a fixed sized window dataset (build partials for each minute) and then use window frame of fixed size over that, OR self-join and just aggregate those (as Felipe has shown).

If you have very dense data, you might find the former more performant, and if you have sparse data, the later approach should be faster, and is definitely faster to code.

So the simple first:

with data(city, timestamp, value) as (
    select column1, try_to_timestamp(column2, 'yyyy/mm/dd hh:mi'), column3 from values 
    ('beijing', '2022/05/25 10:33', 22),
    ('beijing', '2022/05/25 10:37', 20),
    ('beijing', '2022/05/25 11:36', 29),
    ('beijing', '2022/05/26 11:36', 28),
    ('beijing', '2022/05/26 10:00', 21),
    ('shanghai', '2022/05/26 11:00', 33),
    ('shanghai', '2022/05/26 11:46', 35),
    ('shanghai', '2022/05/26 12:40', 37)
)

select a.*
    ,avg(b.value) as p60_avg
    ,count(b.value)-1 as p60_count
    ,sum(b.value) as p60_sum
from data as a
left join data as b
    on a.city = b.city and b.timestamp between dateadd(hour, -1, a.timestamp) and a.timestamp
group by 1,2,3
order by 1,2

gives:

CITY	TIMESTAMP	VALUE	P60_AVG	P60_COUNT	P60_SUM
beijing	2022-05-25 10:33:00.000	22	22	0	22
beijing	2022-05-25 10:37:00.000	20	21	1	42
beijing	2022-05-25 11:36:00.000	29	24.5	1	49
beijing	2022-05-26 10:00:00.000	21	21	0	21
beijing	2022-05-26 11:36:00.000	28	28	0	28
shanghai	2022-05-26 11:00:00.000	33	33	0	33
shanghai	2022-05-26 11:46:00.000	35	34	1	68
shanghai	2022-05-26 12:40:00.000	37	36	1	72

The dense version:

with data(city, timestamp, value) as (
    select column1, try_to_timestamp(column2, 'yyyy/mm/dd hh:mi'), column3 from values 
    ('beijing', '2022/05/25 10:33', 22),
    ('beijing', '2022/05/25 10:37', 20),
    ('beijing', '2022/05/25 11:36', 29),
    ('beijing', '2022/05/26 11:36', 28),
    ('beijing', '2022/05/26 10:00', 21),
    ('shanghai', '2022/05/26 11:00', 33),
    ('shanghai', '2022/05/26 11:46', 35),
    ('shanghai', '2022/05/26 12:40', 37)
), filled_time as (
    select city,
        dateadd(minute, row_number() over(partition by city order by null)-1, min_t) as timestamp
    from (
        select 
            city, min(timestamp) as min_t, max(timestamp) as max_t
        from data
        group by 1
    ), table(generator(ROWCOUNT => 10000))
    qualify timestamp <= max_t
)
select
    ft.city
    ,ft.timestamp
    ,avg(d.value) over (order by ft.timestamp ROWS BETWEEN 60 PRECEDING AND current row ) as p60_avg
from filled_time as ft
left join data as d
    on ft.city = d.city and ft.timestamp = d.timestamp
order by 1,2;

gives:

CITY	TIMESTAMP	P60_AVG
beijing	2022-05-25 10:33:00.000	22
beijing	2022-05-25 10:34:00.000	22
beijing	2022-05-25 10:35:00.000	22
beijing	2022-05-25 10:36:00.000	22
beijing	2022-05-25 10:37:00.000	21
beijing	2022-05-25 10:38:00.000	21
beijing	2022-05-25 10:39:00.000	21
beijing	2022-05-25 10:40:00.000	21
beijing	2022-05-25 10:41:00.000	21
beijing	2022-05-25 10:42:00.000	21
beijing	2022-05-25 10:43:00.000	21
beijing	2022-05-25 10:44:00.000	21
beijing	2022-05-25 10:45:00.000	21
beijing	2022-05-25 10:46:00.000	21
snip...

And those "extra" rows could be dumped with a qualify

select
    ft.city
    ,ft.timestamp
    ,avg(d.value) over (order by ft.timestamp ROWS BETWEEN 60 PRECEDING AND current row ) as p60_avg
    --,count(b.value)-1 as p60_count
    --,sum(b.value) as p60_sum
from filled_time as ft
left join data as d
    on ft.city = d.city and ft.timestamp = d.timestamp
qualify d.value is not null
order by 1,2;

Snowflake: Aggregating by a sliding window (the past 60 mins) for a dataset where the sampling frequency is non-uniform

2 Answers2

The dense version: