2

I am trying count distinct id's within a day which is straight forward but how to count them for 7 day's back from the current date in the window, 30 day's back in the current window.

I am sliding back for 7/30 rows from the current row but I think it won't be accurate it could more/less than 7/30 days because I think it just slides by the number of rows so do you have any Idea of handling this?

--mockup data SnowSQL
with test  (date,id,a,b,c,d) as(
select $1 as date, $2 as id, $3 as a, $4 as b, $5 as c, $6 as d
from values 
('2019-07-15','x_1','ps','e','US','NA'),
('2019-07-15','x_2','ps','e','US','NA'),
('2019-07-15','x_2','ps','e','CA','NA'),
('2019-07-16','x_2','ps','e','CA','NA'),
('2019-07-16','x_3','c','xb','CH','AS'),
('2019-07-17','x_4','ps','e','US','NA'),
('2019-07-17','x_5','c','ps4','CH','AS'),
('2019-07-17','x_6','c','ps4','CH','AS'),
('2019-07-17','x_7','c','ns','CH','AS'),
('2019-07-18','x_7','c','ns','CH','AS'),
('2019-07-18','x_7','c','ns','CH','AS'),
('2019-07-18','x_7','c','ns','CH','AS'),
('2019-07-18','x_7','c','ns','CH','AS'),
('2019-07-19','x_8','c','ps','CH','AS'),
('2019-07-20','x_8','c','ps','CH','AS'),
('2019-07-21','x_8','c','ps','CH','AS'),
('2019-07-22','x_8','c','ps','CH','AS'),
('2019-07-29','x_8','c','ps','CH','AS'),
('2019-08-01','x_8','c','ps','CH','AS'),
('2019-08-02','x_9','c','ps','CH','AS'),
('2019-08-03','y_1','c','ps','CH','AS'),
('2019-08-04','y_8','c','ps','CH','AS'),
('2019-08-05','z_8','c','ps','CH','AS'),
('2019-08-06','a_8','c','ps','CH','AS'),
('2019-08-07','b_8','c','ps','CH','AS'),
('2019-08-07','c_8','c','ns','CH','AS'),
('2019-08-07','e_8','ps','e','US','NA'),
('2019-08-08','f_8','c','xb','CH','AS'),
('2019-08-08','f_8','c','xb','CH','AS'),
('2019-08-08','f_8','c','xb','CH','AS'),
('2019-08-09','f_8','c','xb','CH','AS'),
('2019-08-09','f_8','c','xb','CH','AS'),
('2019-08-09','f_8','c','xb','CH','AS'),
('2019-08-10','gx_8','c','ps','CH','AS'),
('2019-08-11','v_8','c','ps','CH','AS') )
--subquery
select 1 as part,
(
  select sum(x)
  from (
select count(distinct id) as x
from test
where
 date = '2019-07-16'
group by a,b,c,d) ) as dau,
(
  select sum(x)
  from (
select count(distinct id) as x
from test
where
 date between '2019-07-16'::date-interval '7 days' and '2019-07-16'
//    <= '2019-07-15'
//and date >= '2019-07-15'::date-interval '7 days'
group by a,b,c,d)) as w,
(select sum(x)
  from (
select count(distinct id) as x
from test
where
 date <= '2019-07-16'
and date >= '2019-07-16'::date-interval '30 days'
group by a,b,c,d)) as m
union
--window function
select 2,sum(dau),sum(w),sum(m)
from(
select 
date,
a,
b,
c,
d,
count(distinct id) over (Partition by date,a,b,c,d Order by date)as dau,
count(distinct id) over (Partition by a,b,c,d Order by date rows between 7 preceding  and current row) as w,
count(distinct id) over (Partition by a,b,c,d Order by date rows between 30 preceding  and current row) as m
from test
group by
date,
a,
b,
c,
d,
id)
where date='2019-07-16'
group by date
;

1st Part is doing the counts within the time frame by the subquery

2nd Part is doing the counts using the sliding function

expected result: both should get the same count for the same time frame.

  • Might look into using range instead of rows for the frame. – Shawn Aug 20 '19 at 18:07
  • (1) I don't fully understand what your query has to do with your question. (2) Some sample data *and* desired results would help. (3) Snowflake doesn't support `range` with sliding window frames, so you will probably need another (more expensive) mechanism. – Gordon Linoff Aug 20 '19 at 18:14
  • @GordonLinoff I am just counting distinct id's within a day, past 7/30 days 1) The above SQL is doing 3 things i) Generating sample data ii) Generating the desired result using subquery(1st part of the union) iii) Trying to generate the desired result using a window function – Ravi Satya Yenugula Aug 20 '19 at 18:41
  • @GordonLinoff 2) The SQL will generate sample data. This is specific to SnowSQL so use SnowSQL client to run the SQL. The desired result is the 1st part. So in the result the row with the value as 1 for the field "part" is the expected result 3) Can you give some more insights on how the code looks like for another mechanism if so it could help me to give a try. – Ravi Satya Yenugula Aug 20 '19 at 18:41
  • @Shawn, `Sliding window frame unsupported for function COUNT` – Ravi Satya Yenugula Aug 20 '19 at 18:46
  • As @GordonLinoff pointed out, you cannot use a sliding window to represent a date range in Snowflake unless you have exactly one row per date. Otherwise it can only represent a specific count of rows. Why can't you use your subquery formulation that works? – Stuart Ozer Aug 27 '19 at 02:21

3 Answers3

0

If I understand the question correctly, you can use the range arguments in the window function like this (I dumped your table values into a temp table for simplicity):

select distinct [date], id
      from #test
order by [date] desc

select [date], 
        count(*) over (order by date desc rows between current row and 7 following ) count_7_day,
        count(*) over (order by date desc rows between current row and 30 following ) count_30_day
from (select distinct [date], id
      from #test )x
order by [date] desc

hopefully this helps?

ben
  • 199
  • 4
0

with the date in a table

create table  test as(
select $1 as date, $2 as id, $3 as a, $4 as b, $5 as c, $6 as d
from values 
('2019-07-15','x_1','ps','e','US','NA'),
('2019-07-15','x_2','ps','e','US','NA'),
('2019-07-15','x_2','ps','e','CA','NA'),
('2019-07-16','x_2','ps','e','CA','NA'),
('2019-07-16','x_3','c','xb','CH','AS'),
('2019-07-17','x_4','ps','e','US','NA'),
('2019-07-17','x_5','c','ps4','CH','AS'),
('2019-07-17','x_6','c','ps4','CH','AS'),
('2019-07-17','x_7','c','ns','CH','AS'),
('2019-07-18','x_7','c','ns','CH','AS'),
('2019-07-18','x_7','c','ns','CH','AS'),
('2019-07-18','x_7','c','ns','CH','AS'),
('2019-07-18','x_7','c','ns','CH','AS'),
('2019-07-19','x_8','c','ps','CH','AS'),
('2019-07-20','x_8','c','ps','CH','AS'),
('2019-07-21','x_8','c','ps','CH','AS'),
('2019-07-22','x_8','c','ps','CH','AS'),
('2019-07-29','x_8','c','ps','CH','AS'),
('2019-08-01','x_8','c','ps','CH','AS'),
('2019-08-02','x_9','c','ps','CH','AS'),
('2019-08-03','y_1','c','ps','CH','AS'),
('2019-08-04','y_8','c','ps','CH','AS'),
('2019-08-05','z_8','c','ps','CH','AS'),
('2019-08-06','a_8','c','ps','CH','AS'),
('2019-08-07','b_8','c','ps','CH','AS'),
('2019-08-07','c_8','c','ns','CH','AS'),
('2019-08-07','e_8','ps','e','US','NA'),
('2019-08-08','f_8','c','xb','CH','AS'),
('2019-08-08','f_8','c','xb','CH','AS'),
('2019-08-08','f_8','c','xb','CH','AS'),
('2019-08-09','f_8','c','xb','CH','AS'),
('2019-08-09','f_8','c','xb','CH','AS'),
('2019-08-09','f_8','c','xb','CH','AS'),
('2019-08-10','gx_8','c','ps','CH','AS'),
('2019-08-11','v_8','c','ps','CH','AS') );

the way I have done this in the past where I wanted to graph the different day as rows, thus had the results as different rows, to allow by colour by the days column I solve this like this:

with distinct_data as (
    select distinct date, id
    from test
), days as ( -- set of days we want to range over
    select column1 as days from (values (1),(7),(30))
), windows_ends as ( -- last 60 days of date
    select dateadd('day',-SEQ8(),current_date) as win_date_end
    from table(generator(rowcount => 60))
), windows as (
    select d.days
        ,w.win_date_end
        ,dateadd('day',-d.days, w.win_date_end) as win_date_start
    from days as d
    join windows_ends as w
)
select w.days
    ,w.win_date_start
    ,w.win_date_end
    ,count(distinct(d.id)) as c_id
from distinct_data as d
join windows as w on d.date > w.win_date_Start and d.date <= w.win_date_end
group by 1,2,3
order by 3,1;

for my results I dropped rows that win_date_start was before the start of my data so I would not get 30 days that started 2 days after the data set started.

The above allow a dynamic amount of days (which I found really useful) but in the context of having three fixed results this I just added a pivot, as such..

with distinct_data as (
    select distinct date, id
    from test
), days as ( -- set of days we want to range over
    select column1 as days from (values (1),(7),(30))
), windows_ends as ( -- last 60 days of date
    select dateadd('day',-SEQ8(),current_date) as win_date_end
    from table(generator(rowcount => 60))
), windows as (
    select d.days
        ,w.win_date_end
        ,dateadd('day',-d.days, w.win_date_end) as win_date_start
    from days as d
    join windows_ends as w
), manyrows as (
  select w.days
      ,w.win_date_end
      ,count(distinct(d.id)) as c_id
  from distinct_data as d
  join windows as w on d.date > w.win_date_Start and d.date <= w.win_date_end
  group by 1,2
)
select win_date_end as date, "1" as d1, "7" as d7, "30" as d30
from manyrows
   pivot(sum(c_id) for days in (1,7,30)) as p
order by win_date_end;

but you can see how the results have the trailing effect for the last dates... which I would be inclined to filter out..

also this need the 60 from now filter to be changed

I feel this could be done more efficiently given your fixed 30 day window, like so

with distinct_data as (
    select distinct date
        ,id
    from test
), distinct_dates as (
    select distinct date
        ,dateadd('day',-30,date) as d30_s
    from distinct_data
), mixed_data as (
select d.date
    ,b.id
    ,datediff('days',b.date, d.date) as d_diff
    ,iff(d_diff < 1, id, NULL) as d1_id
    ,iff(d_diff < 7, id, NULL) as d7_id
    ,iff(d_diff < 30, id, NULL) as d30_id
from distinct_dates as d
join distinct_data as b on b.date >= d.d30_s and b.date <= d.date
)
select date
    ,count(distinct(d1_id)) as d1
    ,count(distinct(d7_id)) as d7
    ,count(distinct(d30_id)) as d30
from mixed_data
group by 1 order by 1;

but this only gives windowed sums for days present so first two methods gives a 7 day and 30 day for the 2019-7-23 where-as the later does not have that day.. so it depends on how you need want the data.

Simeon Pilgrim
  • 22,906
  • 3
  • 32
  • 45
0

If we use the current row we will not get the excepted output unless there is only 1-row per your case so we should use slide by a range if that function is available. If not join your data with date dimension and populate the data for your sliding window and use a case statement to get your windows(if you have multiple windows). Now as we have the required data use regular window functions on the case and apply count on it.