Combine timeseries data into datapoints with start-end-intervals based on time difference

Question

Description

System: Postgres 13 + TimescaleDB

I have a timeseries of messages containing error codes generated by devices in intervals of 300 seconds. This timeseries should be aggregated so that consequent error codes (over several continous intervals) by the same device get grouped and the intervall summed up.

Source Format

ts	code	device
2023-03-01 12:00:00	4	A
2023-03-01 12:00:00	3	B
2023-03-01 12:05:00	3	B
2023-03-01 12:10:00	3	B
2023-03-01 12:10:00	4	B
2023-03-01 14:00:00	3	A
2023-03-01 14:30:00	4	A

Target Format

valid_interval	code	device
(2023-03-01 12:00:00, 2023-03-01 12:04:59)	4	A
(2023-03-01 12:00:00, 2023-03-01 12:14:59)	3	B
(2023-03-01 12:10:00, 2023-03-01 12:14:59)	4	B
(2023-03-01 14:00:00, 2023-03-01 14:04:59)	3	A
(2023-03-01 14:30:00, 2023-03-01 14:34:59)	4	A

Progress

I have tried using [LAG()/LEAD()](https://www.postgresql.org/docs/13/functions-window.html) and PARITION BY (code, device) but i can not get it to work using the condition to only aggregate consequent rows:

SELECT ts,
       device,
       code,
       LEAD(ts) OVER (PARTITION BY device, code ORDER BY ts) as next_ts
FROM   source_format

DB-Fiddle

Schema (PostgreSQL v13)

    CREATE TABLE timeseries (
      ts timestamptz,
      code bigint,
      device varchar
      );
    
    INSERT INTO timeseries VALUES ('2023-03-01 12:00:00', 1, 'A');
    INSERT INTO timeseries VALUES ('2023-03-01 12:05:00', 1, 'A');
    INSERT INTO timeseries VALUES ('2023-03-01 12:10:00', 1, 'A');
    INSERT INTO timeseries VALUES ('2023-03-01 12:10:00', 2, 'A');
    INSERT INTO timeseries VALUES ('2023-03-01 12:25:00', 1, 'A');
    INSERT INTO timeseries VALUES ('2023-03-01 12:30:00', 1, 'A');
    
    INSERT INTO timeseries VALUES ('2023-03-01 12:00:00', 1, 'B');
    INSERT INTO timeseries VALUES ('2023-03-01 12:20:00', 1, 'B');
    INSERT INTO timeseries VALUES ('2023-03-01 12:20:00', 3, 'B');
    INSERT INTO timeseries VALUES ('2023-03-01 12:25:00', 3, 'B');

Query #1 (Getting Timediff to next message of same device and code)

    SELECT ts,
           device,
           code,
           LEAD(ts) OVER (PARTITION BY device, code ORDER BY ts) - ts as diff_to_next_ts
    FROM   timeseries;

Desired Result

valid_interval	device	code
(2023-03-01T12:00:00.000Z, 2023-03-01T12:10:00.000Z)	A	1
(2023-03-01T12:25:00.000Z, 2023-03-01T12:30:00.000Z)	A	1
(2023-03-01T12:10:00.000Z, 2023-03-01T12:10:00.000Z)	A	2
(2023-03-01T12:00:00.000Z, 2023-03-01T12:00:00.000Z)	B	1
(2023-03-01T12:20:00.000Z, 2023-03-01T12:20:00.000Z)	B	1
(2023-03-01T12:20:00.000Z, 2023-03-01T12:25:00.000Z)	B	3

View on DB Fiddle

How can i go about including the condition and then "merging" the start and end of consequent code messages into a single row with an interval? Is there a more fitting method to use? Would a pgSQL function be more appropriate?

Hi, could you provide a fiddle with your data and your desired result - it's not quite clear to me **exactly** what you want - thanks! — Vérace, Apr 03 '23 at 14:40
@Vérace i added the fiddle and desired output. In the desired result you can see that consequent rows of the same device and code (i.e. diff_to_next_ts <= 5) have been combined to one row. The valid_interval of this row then includes the first and last timestamp of these consequent rows. — dev-, Apr 03 '23 at 15:10
So, you want 1,2 for A and 1,3 for B? Could you put your desired result in text into the fiddle (usiing -- as comment for beginning of the line). — Vérace, Apr 03 '23 at 15:15
The desired result is exactly the format as required. The rows can not be combined further, as they are separate errors, e.g. device 1 has code 1 from 12:00 to 12:10, then again from 12:25. — dev-, Apr 03 '23 at 16:22
Please see [here](https://www.db-fiddle.com/f/7CiNgd7iD2KoiocyMn19kb/4)! Add further explanation as necessary! — Vérace, Apr 03 '23 at 16:32
@Vérace done, i hope it helps, otherwise i will get back later on and try wording it completely new. — dev-, Apr 03 '23 at 17:02

score 2 · Accepted Answer · answered Apr 04 '23 at 03:23

The following query will produce the specified results (the time zone might differ since the inserted times don't include the time zone but the column's type is timestamptz):

WITH labeled_ends AS (
  SELECT
    lag(ts.ts) OVER (PARTITION BY device,
      code ORDER BY ts.ts) = ts.ts - interval '5' minute IS NOT TRUE AS begins_period,
    ts.ts,
    lead(ts.ts) OVER (PARTITION BY device,
      code ORDER BY ts.ts) = ts.ts + interval '5' minute IS NOT TRUE AS ends_period,
    ts.device,
    ts.code
  FROM
    timeseries ts
),
periods AS (
  SELECT
    labeled_ends.ts,
    CASE WHEN labeled_ends.ends_period THEN
      labeled_ends.ts
    ELSE
      lead(labeled_ends.ts) OVER (PARTITION BY labeled_ends.device,
        labeled_ends.code ORDER BY labeled_ends.ts)
    END AS period_end,
    labeled_ends.device,
    labeled_ends.code,
    labeled_ends.begins_period
  FROM
    labeled_ends
  WHERE
    labeled_ends.begins_period
    OR labeled_ends.ends_period
)
SELECT
  tstzrange(periods.ts, periods.period_end, '[]') AS valid_interval,
  periods.device,
  periods.code
FROM
  periods
WHERE
  periods.begins_period
ORDER BY
  periods.device,
  periods.code,
  periods.ts;

The query's first CTE, labeled_ends, determines whether each message in the timeseries begins or ends a period where a device has a specific code. The second CTE, periods, the time of the last message in a contiguous series for each device and code. The final SELECT returns the range for each period that begins a contiguous series.

Consider changing the closed range tstzrange(periods.ts, periods.period_end, '[]') AS valid_interval to the half opened range tstzrange(periods.ts, periods.period_end + interval '5' minute, '[)') AS valid_interval. Doing so facilitates using range operators to determine overlaps and adjacency independent of the granularity of the time series. Changing the name to valid_period might also be more descriptive since interval is commonly defined as an unanchored length of time.

Looks very good, thanks. I will test more thoroughly tomorrow but I think your solution does the trick. Also nice explanation and tips, very helpful! — dev-, Apr 04 '23 at 21:14