Count distinct customers over rolling window partition

Question

My question is similar to redshift: count distinct customers over window partition but I have a rolling window partition.

My query looks like this but distinct within COUNT in Redshift is not supported

select p_date, seconds_read, 
count(distinct customer_id) over (order by p_date rows between unbounded preceding and current row) as total_cumulative_customer
from table_x

My goal is to calculate total unique customer up to every date (hence rolling window).

I tried using the dense_rank() approach but it would simply fail since I cannot use window function like this

select p_date, max(total_cumulative_customer) over ()
(select p_date, seconds_read, 
dense_rank() over (order by customer_id rows between unbounded preceding and current row) as total_cumulative_customer -- WILL FAIL HERE
from table_x

Any workaround or different approach would be helpful!

EDIT:

INPUT DATA sample

+------+----------+--------------+
| Cust |  p_date  | seconds_read |
+------+----------+--------------+
|    1 | 1-Jan-20 |           10 |
|    2 | 1-Jan-20 |           20 |
|    4 | 1-Jan-20 |           30 |
|    5 | 1-Jan-20 |           40 |
|    6 | 5-Jan-20 |           50 |
|    3 | 5-Jan-20 |           60 |
|    2 | 5-Jan-20 |           70 |
|    1 | 5-Jan-20 |           80 |
|    1 | 5-Jan-20 |           90 |
|    1 | 7-Jan-20 |          100 |
|    3 | 7-Jan-20 |          110 |
|    4 | 7-Jan-20 |          120 |
|    7 | 7-Jan-20 |          130 |
+------+----------+--------------+

Expected Output

+----------+--------------------------+------------------+--------------------------------------------+
|  p_date  | total_distinct_cum_cust  | sum_seconds_read |                  Comment                   |
+----------+--------------------------+------------------+--------------------------------------------+
| 1-Jan-20 |                        4 |              100 | total distinct cust = 4 i.e. 1,2,4,5       |
| 5-Jan-20 |                        6 |              450 | total distinct cust = 6 i.e. 1,2,3,4,5,6   |
| 7-Jan-20 |                        7 |              910 | total distinct cust = 6 i.e. 1,2,3,4,5,6,7 |
+----------+--------------------------+------------------+--------------------------------------------+

. . Sample data and desired results would help. I don't understand the purpose of `seconds_read`. — Gordon Linoff, Oct 13 '20 at 00:34
@GordonLinoff Added sample data. Seconds read would be cumulative sum — Pirate X, Oct 13 '20 at 01:28

Gordon Linoff · Accepted Answer · 2020-10-13T01:33:04.633

For this operation:

select p_date, seconds_read, 
       count(distinct customer_id) over (order by p_date rows between unbounded preceding and current row) as total_cumulative_customer
from table_x;

You can do pretty much what you want with two levels of aggregation:

select min_p_date,
       sum(count(*)) over (order by min_p_date rows between unbounded preceding and current row) as running_distinct_customers
from (select customer_id, min(p_date) as min_p_date
      from table_x
      group by customer_id
     ) c
group by min_p_date;

Summing the seconds read as well is a bit tricky, but you can use the same idea:

select p_date,
       sum(sum(seconds_read)) over (order by p_date rows between unbounded preceding and current row) as seconds_read,
       sum(sum(case when seqnum = 1 then 1 else 0 end)) over (order by p_date rows between unbounded preceding and current row) as running_distinct_customers
from (select customer_id, p_date, seconds_read,
             row_number() over (partition by customer_id order by p_date) as seqnum
      from table_x
     ) c
group by min_p_date;

In the case of a rolling window where the same user appears in multiple windows, will this consider the user only once across all windows because of min(p_date)? — nescobar, Jun 09 '21 at 14:47
@nescobar . . . The user is only considered on the minimum date. — Gordon Linoff, Jun 09 '21 at 14:57

score 1 · Answer 2 · answered Oct 12 '20 at 23:46

1

One workaround uses a subquery:

select p_date, seconds_read, 
    (
        select count(distinct t1.customer_id) 
        from table_x t1
        where t1.p_date <= t.p_date
    ) as total_cumulative_customer
from table_x t

answered Oct 12 '20 at 23:46

GMB

216,147
25
84
135

My current table has 100B rows. Do you think this approach would be good as `count(distinct)` one? (I'm not questioning your skills, I genuinely don't know which one performs better) – Pirate X Oct 12 '20 at 23:51
@PirateX You don't have much choice here as count distinct cannot be used as a window function. – Tim Biegeleisen Oct 13 '20 at 00:04

score 1 · Answer 3 · answered Mar 31 '22 at 10:04

I'd like to add that you can also accomplish this with an explicit self join which is, in my opinion, more straightforward and readable than the subquery approaches described in the other answers.

select 
  t1.p_date, 
  sum(t2.seconds_read) as sum_seconds_read, 
  count(distinct t2.customer_id) as distinct_cum_cust_totals
from
  table_x t1
join
  table_x t2
on
  t2.date <= t1.date
group by
  t1.date

Most query planners will reduce a correlated subquery like in the solutions above to an efficient join like this, so either solution is usually fine, but for the general case, I believe this is a better solution since some engines (like BigQuery) won't allow correlated subqueries and will force you to explicitly define the join in your query.

I was doing rolling, distinct string aggregations using psql and this approach allowed to me replace 3 ctes with 1 and also drop all the `nest` and `unnest` functions which helped a lot for readability — Scott Worland, May 02 '22 at 14:30

Count distinct customers over rolling window partition

3 Answers3