1

I have some event data that looks like this:

| time                    | id | status   | value |
|-------------------------|----|----------|-------|
| 2020-08-26T21:29:01.000 | 2  | started  | 8     |
| 2020-08-26T21:29:01.000 | 3  | started  | 4     |
| 2020-08-26T21:29:02.000 | 2  | finished | 8     |
| 2020-08-26T21:29:03.000 | 4  | started  | 12    |
| 2020-08-26T21:29:04.000 | 5  | started  | 2     |
| 2020-08-26T21:29:05.000 | 6  | started  | 24    |
| 2020-08-26T21:29:06.000 | 4  | finished | 12    |
| 2020-08-26T21:29:06.000 | 3  | finished | 4     |
| 2020-08-26T21:29:07.000 | 1  | finished | 1     |
| 2020-08-26T21:29:10.000 | 7  | started  | 4     |

Note that the event data was started recording after things had started and there are events that haven't yet finished.

I'm trying to then get a running count of the rows and running sum of the values per second.

As soon as I think running count I think of window queries but I'm struggling to work out how I get from this data to the output I expect.

I'd ideally then expect to get the following results:

| time                    | count | sum_values |
|-------------------------|-------|------------|
| 2020-08-26T21:29:00.000 | 1     | 1          |
| 2020-08-26T21:29:01.000 | 3     | 13         |
| 2020-08-26T21:29:02.000 | 2     | 5          |
| 2020-08-26T21:29:03.000 | 3     | 17         |
| 2020-08-26T21:29:04.000 | 4     | 19         |
| 2020-08-26T21:29:05.000 | 5     | 43         |
| 2020-08-26T21:29:06.000 | 3     | 29         |
| 2020-08-26T21:29:07.000 | 2     | 28         |
| 2020-08-26T21:29:08.000 | 2     | 28         |
| 2020-08-26T21:29:09.000 | 2     | 28         |
| 2020-08-26T21:29:10.000 | 3     | 32         |
| 2020-08-26T21:29:11.000 | 3     | 32         |

I'd also be happy enough with an answer that didn't take into account the 1 id record that was running before events began recording which would then have the following results:

| time                    | count | sum_values |
|-------------------------|-------|------------|
| 2020-08-26T21:29:00.000 | 0     | 0          |
| 2020-08-26T21:29:01.000 | 2     | 12         |
| 2020-08-26T21:29:02.000 | 1     | 4          |
| 2020-08-26T21:29:03.000 | 2     | 16         |
| 2020-08-26T21:29:04.000 | 3     | 18         |
| 2020-08-26T21:29:05.000 | 4     | 42         |
| 2020-08-26T21:29:06.000 | 2     | 28         |
| 2020-08-26T21:29:07.000 | 2     | 28         |
| 2020-08-26T21:29:08.000 | 2     | 28         |
| 2020-08-26T21:29:09.000 | 2     | 28         |
| 2020-08-26T21:29:10.000 | 3     | 32         |
| 2020-08-26T21:29:11.000 | 3     | 32         |

As Athena/Presto doesn't support full joins I was able to get the start and stop times for each id with the following query (also on SQL Fiddle):

WITH started AS (
  SELECT *
  FROM foo
  WHERE status = 'started'
), finished AS (
  SELECT *
  FROM foo
  WHERE status = 'finished'
)
SELECT started.time AS started_time, finished.time AS finished_time, started.id, started.value
FROM started LEFT JOIN finished ON started.id = finished.id
ydaetskcoR
  • 53,225
  • 8
  • 158
  • 177
  • Are there any gaps (seconds) in the time data that you want in the result set ? If so you will also need to join to a 'tally' table' – Richard Sep 26 '20 at 14:02
  • Yes, there will be gaps. I can edit the question to make this clearer too. – ydaetskcoR Sep 26 '20 at 14:12
  • The timestamps are also millisecond precise but I'm going to `date_trunc` them all anyway so you can consider it to just be second precise. – ydaetskcoR Sep 26 '20 at 14:25

1 Answers1

1

I think you want a cumulative conditional sum:

select time,
       sum(sum(case when status = 'started' then 1
                    when status = 'finished' then -1
               end)
          ) over (order by time) as running_count,
       sum(sum(case when status = 'started' then value
                    when status = 'finished' then - value
                end)
          ) over (order by time) as running_value
from foo
group by time

The sum()s need to be nested because one is needed for the window function and the other is for the aggregation.

ydaetskcoR
  • 53,225
  • 8
  • 158
  • 177
Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
  • At a glance this looks like it should run to me but when I run this query in Athena I get the following error: `Error running query: SYNTAX_ERROR: line 2:8: '"sum"((CASE WHEN ("status" = 'started') THEN 1 WHEN ("status" = 'finished') THEN -1 END)) OVER (ORDER BY "time" ASC)' must be an aggregate expression or appear in GROUP BY clause` – ydaetskcoR Sep 26 '20 at 14:23
  • @ydaetskcoR . . . Fixed. The `sum()`s need to be nested, one for the aggregation and one for the window function. – Gordon Linoff Sep 26 '20 at 15:19