Use condition in partition by window in SnowFlake

Question

Some what related: Filtering a Query based on a Date and Window function in Snowflake

I need to create a query that count the number of occurance of an id in a -+ 90 days window, similar to this but as a window function, is that possible?

WITH fake_data(id, DATE) as (
    SELECT * FROM VALUES
    -- this id has visted once
    (1, '2022-04-14'::date),
    -- this id has visited 3 times
    (3, '2022-01-13'::date),
    (3, '2022-03-13'::date),
    (3, '2022-05-13'::date),
    -- this id is a huge vistor
    (5, '2022-01-01'::date),
    (5, '2022-02-01'::date),
    (5, '2022-05-01'::date),
    (5, '2022-06-01'::date),
    (5, '2022-08-01'::date)
)
select * from (
    select 
    count_if("change" between -90 and 90) over (partition by ID, t1.DATE) "c",
    *
    from fake_data as t1
    left outer join lateral (
        select t1.DATE - t.DATE "change", t.DATE "t_DATE" 
        from fake_data AS t
        where t1.id = t.id and t1.DATE - t.DATE between -90 and 9
     ) as t2
    order by ID, t1.DATE, "change"
)
where "change" = 0;

Result (change and t_DATE are just included for reference):

c	ID	DATE	t_DATE
1	1	2022-04-14	2022-04-14
2	3	2022-01-13	2022-01-13
3	3	2022-03-13	2022-03-13
2	3	2022-05-13	2022-04-13
2	5	2022-01-01	2022-01-01
3	5	2022-02-01	2022-02-01
3	5	2022-05-01	2022-05-01
3	5	2022-06-01	2022-06-01
2	5	2022-08-01	2022-08-01

This is what I like to do but it doesn't seem like the current row's DATE is available (or I can use alias):

select
  count_if(DATE - d between -90 and 90) over (partition by id, DATE  as d) as "c",
  id,
  date
from fake_data;

You want `RANGE BETWEEN`, which would allow you to specify a sliding window that includes only rows where the date is within +/- 90 days. Snowflake doesn't support it, so you need to revert to alternative methods. https://stackoverflow.com/questions/64337440/snowflake-sql-range-between-interval — MatBailie, Aug 25 '22 at 22:20

Simeon Pilgrim · Accepted Answer · 2022-08-25T22:09:24.423

hmm,

even though it's not what you want, your SQL can be written:

select 
    t1.DATE - t.DATE as change
    ,count_if(abs(t1.DATE - t.DATE) <= 90) over (partition by t1.ID, t1.DATE) as c
    ,t1.*
    ,t.date as t_date
from fake_data as t1
left join fake_data as t
    on t1.id = t.id and abs(t1.DATE - t.DATE) <= 90
qualify change = 0
order by t1.ID, t1.DATE, change

But given the join is the same as your count_if, that can also be written:

select 
    t1.DATE - t.DATE as change
    ,count(*) over (partition by t1.ID, t1.DATE) as c
    ,t1.*
    ,t.date as t_date
from fake_data as t1
left join fake_data as t
    on t1.id = t.id and abs(t1.DATE - t.DATE) <= 90
qualify change = 0
order by t1.ID, t1.DATE, change

But given Window functions do not have a "this row" as you wish for the time range, you could work around that by using a Javascript UDTF and for each row build a in memory set, and do the count via that, and then emit that in the final, and then join to that.

At which point you might as well explode the data and do equi-joins in raw SQL, which for +90,-90 days, is likely to still be rather fast for massive data, as compared to self joins

Thus for massive data, this should perform much better:

WITH fake_data(id, DATE) as (
    SELECT * FROM VALUES
    -- this id has visted once
    (1, '2022-04-14'::date),
    -- this id has visited 3 times
    (3, '2022-01-13'::date),
    (3, '2022-03-13'::date),
    (3, '2022-05-13'::date),
    -- this id is a huge vistor
    (5, '2022-01-01'::date),
    (5, '2022-02-01'::date),
    (5, '2022-05-01'::date),
    (5, '2022-06-01'::date),
    (5, '2022-08-01'::date)
), range as (
    select row_number() over (order by null)-91 as rn
    from table(generator(ROWCOUNT => 181))
), exploded as (
    select
        id, 
        dateadd('day', e.rn, d.date) as t_date
    from fake_data as d
    cross join range as e
)
select
    f.*
    ,count(t_date) as c
from fake_data as f
join exploded as e
    on f.id = e.id and f.date = t_date
group by f.id, f.date
order by f.id, f.DATE
;

ID	DATE	C
1	2022-04-14	1
3	2022-01-13	2
3	2022-03-13	3
3	2022-05-13	2
5	2022-01-01	2
5	2022-02-01	3
5	2022-05-01	3
5	2022-06-01	3
5	2022-08-01	2

Could be done with window functions using `RANGE BETWEEN`, but Snowflake doesn't implement it. So, the alternatives, such as in this answer, are necessary. — MatBailie, Aug 25 '22 at 22:24
The WINDOW FRAME ranges does support absolute ranges, thus if you had dense data, you could use ROW PREVIOUS 90 AND ROW AFTER 90. — Simeon Pilgrim, Aug 25 '22 at 22:48
This is part of an rather large `view` so I can't use `group by`. This could however work: `select DISTINCT f.* ,count(c) over (partition by f.id, f.DATE) as c from fake_data as f join (select id, dateadd('day', rn, d.date) as c from fake_data as d cross join (select row_number() over (order by null)-91 as rn from table(generator(ROWCOUNT => 181))) ) as e on f.id = e.id and f.date = c order by f.id, f.DATE ` — Fredrik Erlandsson, Aug 26 '22 at 06:30
Distinct is a group by. Why do you believe you cannot use a group by? — Simeon Pilgrim, Aug 26 '22 at 19:39

Use condition in partition by window in SnowFlake

1 Answers1