Get running count and sums of value of all rows that are started but not yet finished for each second

Question

I have some event data that looks like this:

| time                    | id | status   | value |
|-------------------------|----|----------|-------|
| 2020-08-26T21:29:01.000 | 2  | started  | 8     |
| 2020-08-26T21:29:01.000 | 3  | started  | 4     |
| 2020-08-26T21:29:02.000 | 2  | finished | 8     |
| 2020-08-26T21:29:03.000 | 4  | started  | 12    |
| 2020-08-26T21:29:04.000 | 5  | started  | 2     |
| 2020-08-26T21:29:05.000 | 6  | started  | 24    |
| 2020-08-26T21:29:06.000 | 4  | finished | 12    |
| 2020-08-26T21:29:06.000 | 3  | finished | 4     |
| 2020-08-26T21:29:07.000 | 1  | finished | 1     |
| 2020-08-26T21:29:10.000 | 7  | started  | 4     |

Note that the event data was started recording after things had started and there are events that haven't yet finished.

I'm trying to then get a running count of the rows and running sum of the values per second.

As soon as I think running count I think of window queries but I'm struggling to work out how I get from this data to the output I expect.

I'd ideally then expect to get the following results:

| time                    | count | sum_values |
|-------------------------|-------|------------|
| 2020-08-26T21:29:00.000 | 1     | 1          |
| 2020-08-26T21:29:01.000 | 3     | 13         |
| 2020-08-26T21:29:02.000 | 2     | 5          |
| 2020-08-26T21:29:03.000 | 3     | 17         |
| 2020-08-26T21:29:04.000 | 4     | 19         |
| 2020-08-26T21:29:05.000 | 5     | 43         |
| 2020-08-26T21:29:06.000 | 3     | 29         |
| 2020-08-26T21:29:07.000 | 2     | 28         |
| 2020-08-26T21:29:08.000 | 2     | 28         |
| 2020-08-26T21:29:09.000 | 2     | 28         |
| 2020-08-26T21:29:10.000 | 3     | 32         |
| 2020-08-26T21:29:11.000 | 3     | 32         |

I'd also be happy enough with an answer that didn't take into account the 1 id record that was running before events began recording which would then have the following results:

| time                    | count | sum_values |
|-------------------------|-------|------------|
| 2020-08-26T21:29:00.000 | 0     | 0          |
| 2020-08-26T21:29:01.000 | 2     | 12         |
| 2020-08-26T21:29:02.000 | 1     | 4          |
| 2020-08-26T21:29:03.000 | 2     | 16         |
| 2020-08-26T21:29:04.000 | 3     | 18         |
| 2020-08-26T21:29:05.000 | 4     | 42         |
| 2020-08-26T21:29:06.000 | 2     | 28         |
| 2020-08-26T21:29:07.000 | 2     | 28         |
| 2020-08-26T21:29:08.000 | 2     | 28         |
| 2020-08-26T21:29:09.000 | 2     | 28         |
| 2020-08-26T21:29:10.000 | 3     | 32         |
| 2020-08-26T21:29:11.000 | 3     | 32         |

As Athena/Presto doesn't support full joins I was able to get the start and stop times for each id with the following query (also on SQL Fiddle):

WITH started AS (
  SELECT *
  FROM foo
  WHERE status = 'started'
), finished AS (
  SELECT *
  FROM foo
  WHERE status = 'finished'
)
SELECT started.time AS started_time, finished.time AS finished_time, started.id, started.value
FROM started LEFT JOIN finished ON started.id = finished.id

Are there any gaps (seconds) in the time data that you want in the result set ? If so you will also need to join to a 'tally' table' — Richard, Sep 26 '20 at 14:02
Yes, there will be gaps. I can edit the question to make this clearer too. — ydaetskcoR, Sep 26 '20 at 14:12
The timestamps are also millisecond precise but I'm going to `date_trunc` them all anyway so you can consider it to just be second precise. — ydaetskcoR, Sep 26 '20 at 14:25

score 1 · Accepted Answer · edited Sep 27 '20 at 08:52

1

I think you want a cumulative conditional sum:

select time,
       sum(sum(case when status = 'started' then 1
                    when status = 'finished' then -1
               end)
          ) over (order by time) as running_count,
       sum(sum(case when status = 'started' then value
                    when status = 'finished' then - value
                end)
          ) over (order by time) as running_value
from foo
group by time

The sum()s need to be nested because one is needed for the window function and the other is for the aggregation.

edited Sep 27 '20 at 08:52

ydaetskcoR

53,225
8
158
177

answered Sep 26 '20 at 13:50

Gordon Linoff

1,242,037
58
646
786

At a glance this looks like it should run to me but when I run this query in Athena I get the following error: `Error running query: SYNTAX_ERROR: line 2:8: '"sum"((CASE WHEN ("status" = 'started') THEN 1 WHEN ("status" = 'finished') THEN -1 END)) OVER (ORDER BY "time" ASC)' must be an aggregate expression or appear in GROUP BY clause` – ydaetskcoR Sep 26 '20 at 14:23
@ydaetskcoR . . . Fixed. The `sum()`s need to be nested, one for the aggregation and one for the window function. – Gordon Linoff Sep 26 '20 at 15:19

Get running count and sums of value of all rows that are started but not yet finished for each second

1 Answers1