I'm trying to get the number of distinct users for each event at a daily level while maintainig a running sum for every hour. I'm using Athena/Presto as the query engine.
I tried the following query:
SELECT
eventname,
date(from_unixtime(time_bucket)) AS date,
(time_bucket % 86400)/3600 as hour,
count,
SUM(count) OVER (PARTITION BY eventname, date(from_unixtime(time_bucket)) ORDER BY eventname, time_bucket) AS running_sum_count
FROM (
SELECT
eventname,
CAST(eventtimestamp AS bigint) - CAST(eventtimestamp AS bigint) % 3600 AS time_bucket,
COUNT(DISTINCT moengageuserid) as count
FROM clickstream.moengage
WHERE date = '2020-08-20'
AND eventname IN ('e1', 'e2', 'e3', 'e4')
GROUP BY 1,2
ORDER BY 1,2
);
But on seeing the results I realized that taking SUM of COUNT DISTINCT is not correct as it's not additive.
So, I tried the below query
SELECT
eventname,
date(from_unixtime(time_bucket)) AS date,
(time_bucket % 86400)/3600 as hour,
SUM(COUNT(DISTINCT moengageuserid)) OVER (PARTITION BY eventname, date(from_unixtime(time_bucket)) ORDER BY eventname, time_bucket) AS running_sum
FROM (
SELECT
eventname,
CAST(eventtimestamp AS bigint) - CAST(eventtimestamp AS bigint) % 3600 AS time_bucket,
moengageuserid
FROM clickstream.moengage
WHERE date = '2020-08-20'
AND eventname IN ('e1', 'e2', 'e3', 'e4')
);
But this query fails with the following error:
SYNTAX_ERROR: line 5:99: ORDER BY expression '"time_bucket"' must be an aggregate expression or appear in GROUP BY clause