2

I'm trying to get the number of distinct users for each event at a daily level while maintainig a running sum for every hour. I'm using Athena/Presto as the query engine.

I tried the following query:

SELECT
    eventname,
    date(from_unixtime(time_bucket)) AS date,
    (time_bucket % 86400)/3600 as hour,
    count,
    SUM(count) OVER (PARTITION BY eventname, date(from_unixtime(time_bucket)) ORDER BY eventname, time_bucket) AS running_sum_count
FROM (
    SELECT 
        eventname,
        CAST(eventtimestamp AS bigint) - CAST(eventtimestamp AS bigint) % 3600 AS time_bucket,
        COUNT(DISTINCT moengageuserid) as count
    FROM clickstream.moengage
    WHERE date = '2020-08-20'
    AND eventname IN ('e1', 'e2', 'e3', 'e4')
    GROUP BY 1,2
    ORDER BY 1,2
);

But on seeing the results I realized that taking SUM of COUNT DISTINCT is not correct as it's not additive.

So, I tried the below query

SELECT
    eventname,
    date(from_unixtime(time_bucket)) AS date,
    (time_bucket % 86400)/3600 as hour,
    SUM(COUNT(DISTINCT moengageuserid)) OVER (PARTITION BY eventname, date(from_unixtime(time_bucket)) ORDER BY eventname, time_bucket) AS running_sum
FROM (
    SELECT
        eventname,
        CAST(eventtimestamp AS bigint) - CAST(eventtimestamp AS bigint) % 3600 AS time_bucket,
        moengageuserid
    FROM clickstream.moengage
    WHERE date = '2020-08-20'
    AND eventname IN ('e1', 'e2', 'e3', 'e4')
);

But this query fails with the following error:

SYNTAX_ERROR: line 5:99: ORDER BY expression '"time_bucket"' must be an aggregate expression or appear in GROUP BY clause
leftjoin
  • 36,950
  • 8
  • 57
  • 116
nish
  • 6,952
  • 18
  • 74
  • 128
  • 1
    Are you using MySQL or SQLServer? Please flag only one DB. Furthermore, it would help if you could add the output you got and the output you would expect. Last but not least, some sample data would be great. – Tyron78 Feb 02 '21 at 07:05

2 Answers2

1

Count the first time a user appears for the running distinct count:

SELECT eventname, date(from_unixtime(time_bucket)) AS date,
       (time_bucket % 86400)/3600 as hour,
       COUNT(DISTINCT moengageuserid) as hour_cont,
       SUM(CASE WHEN seqnunm = 1 THEN 1 ELSE 0 END) OVER (PARTITION BY eventname, date(from_unixtime(time_bucket)) ORDER BY time_bucket) AS running_distinct_count
FROM (SELECT eventname,
             CAST(eventtimestamp AS bigint) - CAST(eventtimestamp AS bigint) % 3600 AS time_bucket,
             moengageuserid as hour_count,
             ROW_NUMBER() OVER (PARTITION BY eventname, moengageuserid ORDER BY eventtimestamp) as seqnum
      FROM clickstream.moengage
      WHERE date = '2020-08-20' AND
            eventname IN ('e1', 'e2', 'e3', 'e4')
    ) m
GROUP BY 1, 2, 3
ORDER BY 1, 2;
Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
0

To calculate running distinct count you can collect user IDs into set (distinct array) and get the size:

cardinality(set_agg(moengageuserid)) OVER (PARTITION BY eventname, date(from_unixtime(time_bucket)) ORDER BY eventname, time_bucket) AS running_sum

This is analytic function and will assign the same value to the whole partition (eventname, date), you can aggregate records in upper subquery using max(), etc.

leftjoin
  • 36,950
  • 8
  • 57
  • 116