2

My question is similar to redshift: count distinct customers over window partition but I have a rolling window partition.

My query looks like this but distinct within COUNT in Redshift is not supported

select p_date, seconds_read, 
count(distinct customer_id) over (order by p_date rows between unbounded preceding and current row) as total_cumulative_customer
from table_x

My goal is to calculate total unique customer up to every date (hence rolling window).

I tried using the dense_rank() approach but it would simply fail since I cannot use window function like this

select p_date, max(total_cumulative_customer) over ()
(select p_date, seconds_read, 
dense_rank() over (order by customer_id rows between unbounded preceding and current row) as total_cumulative_customer -- WILL FAIL HERE
from table_x

Any workaround or different approach would be helpful!

EDIT:

INPUT DATA sample

+------+----------+--------------+
| Cust |  p_date  | seconds_read |
+------+----------+--------------+
|    1 | 1-Jan-20 |           10 |
|    2 | 1-Jan-20 |           20 |
|    4 | 1-Jan-20 |           30 |
|    5 | 1-Jan-20 |           40 |
|    6 | 5-Jan-20 |           50 |
|    3 | 5-Jan-20 |           60 |
|    2 | 5-Jan-20 |           70 |
|    1 | 5-Jan-20 |           80 |
|    1 | 5-Jan-20 |           90 |
|    1 | 7-Jan-20 |          100 |
|    3 | 7-Jan-20 |          110 |
|    4 | 7-Jan-20 |          120 |
|    7 | 7-Jan-20 |          130 |
+------+----------+--------------+

Expected Output

+----------+--------------------------+------------------+--------------------------------------------+
|  p_date  | total_distinct_cum_cust  | sum_seconds_read |                  Comment                   |
+----------+--------------------------+------------------+--------------------------------------------+
| 1-Jan-20 |                        4 |              100 | total distinct cust = 4 i.e. 1,2,4,5       |
| 5-Jan-20 |                        6 |              450 | total distinct cust = 6 i.e. 1,2,3,4,5,6   |
| 7-Jan-20 |                        7 |              910 | total distinct cust = 6 i.e. 1,2,3,4,5,6,7 |
+----------+--------------------------+------------------+--------------------------------------------+
Pirate X
  • 3,023
  • 5
  • 33
  • 60

3 Answers3

5

For this operation:

select p_date, seconds_read, 
       count(distinct customer_id) over (order by p_date rows between unbounded preceding and current row) as total_cumulative_customer
from table_x;

You can do pretty much what you want with two levels of aggregation:

select min_p_date,
       sum(count(*)) over (order by min_p_date rows between unbounded preceding and current row) as running_distinct_customers
from (select customer_id, min(p_date) as min_p_date
      from table_x
      group by customer_id
     ) c
group by min_p_date;

Summing the seconds read as well is a bit tricky, but you can use the same idea:

select p_date,
       sum(sum(seconds_read)) over (order by p_date rows between unbounded preceding and current row) as seconds_read,
       sum(sum(case when seqnum = 1 then 1 else 0 end)) over (order by p_date rows between unbounded preceding and current row) as running_distinct_customers
from (select customer_id, p_date, seconds_read,
             row_number() over (partition by customer_id order by p_date) as seqnum
      from table_x
     ) c
group by min_p_date;
Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
  • In the case of a rolling window where the same user appears in multiple windows, will this consider the user only once across all windows because of min(p_date)? – nescobar Jun 09 '21 at 14:47
  • @nescobar . . . The user is only considered on the minimum date. – Gordon Linoff Jun 09 '21 at 14:57
1

One workaround uses a subquery:

select p_date, seconds_read, 
    (
        select count(distinct t1.customer_id) 
        from table_x t1
        where t1.p_date <= t.p_date
    ) as total_cumulative_customer
from table_x t
GMB
  • 216,147
  • 25
  • 84
  • 135
  • My current table has 100B rows. Do you think this approach would be good as `count(distinct)` one? (I'm not questioning your skills, I genuinely don't know which one performs better) – Pirate X Oct 12 '20 at 23:51
  • @PirateX You don't have much choice here as count distinct cannot be used as a window function. – Tim Biegeleisen Oct 13 '20 at 00:04
1

I'd like to add that you can also accomplish this with an explicit self join which is, in my opinion, more straightforward and readable than the subquery approaches described in the other answers.

select 
  t1.p_date, 
  sum(t2.seconds_read) as sum_seconds_read, 
  count(distinct t2.customer_id) as distinct_cum_cust_totals
from
  table_x t1
join
  table_x t2
on
  t2.date <= t1.date
group by
  t1.date

Most query planners will reduce a correlated subquery like in the solutions above to an efficient join like this, so either solution is usually fine, but for the general case, I believe this is a better solution since some engines (like BigQuery) won't allow correlated subqueries and will force you to explicitly define the join in your query.

  • I was doing rolling, distinct string aggregations using psql and this approach allowed to me replace 3 ctes with 1 and also drop all the `nest` and `unnest` functions which helped a lot for readability – Scott Worland May 02 '22 at 14:30