3

I have a PostgreSQL (9.4) table that contains time stamp ranges and user IDs, and I need to collapse any overlapping ranges (with the same user ID) into a single record.

I've tried a complicated set of CTEs to accomplish this, but there are some edge cases in our (40,000+ rows) real table that complicate matters. I've come to the conclusion that I probably need a recursive CTE, but I haven't had any luck writing it.

Here's some code to create a test table and populate it with data. This isn't the exact layout of our table, but it's close enough for an example.

CREATE TABLE public.test
(
  id serial,
  sessionrange tstzrange,
  fk_user_id integer
);

insert into test (sessionrange, fk_user_id)
values 
('[2016-01-14 11:57:01-05,2016-01-14 12:06:59-05]', 1)
,('[2016-01-14 12:06:53-05,2016-01-14 12:17:28-05]', 1)
,('[2016-01-14 12:17:24-05,2016-01-14 12:21:56-05]', 1)
,('[2016-01-14 18:18:00-05,2016-01-14 18:42:09-05]', 2)
,('[2016-01-14 18:18:08-05,2016-01-14 18:18:15-05]', 1)
,('[2016-01-14 18:38:12-05,2016-01-14 18:48:20-05]', 1)
,('[2016-01-14 18:18:16-05,2016-01-14 18:18:26-05]', 1)
,('[2016-01-14 18:18:24-05,2016-01-14 18:18:31-05]', 1)
,('[2016-01-14 18:18:12-05,2016-01-14 18:18:20-05]', 3)
,('[2016-01-14 19:32:12-05,2016-01-14 23:18:20-05]', 3)
,('[2016-01-14 18:18:16-05,2016-01-14 18:18:26-05]', 4)
,('[2016-01-14 18:18:24-05,2016-01-14 18:18:31-05]', 2);

I have found that I can do this to get the sessions sorted by the time they started:

select * from test order by fk_user_id, sessionrange

I could use this to determine whether an individual record overlaps with the previous, using window functions:

SELECT *, sessionrange && lag(sessionrange) OVER (PARTITION BY fk_user_id ORDER BY sessionrange)
FROM test
ORDER BY fk_user_id, sessionrange

But this only detects whether the single previous record overlaps the current one (see the record where id = 6). I need to detect all the way back to the beginning of the partition.

After that, I'd need to group any records that overlap together, to find the beginning of the earliest session and the end of the last session to terminate.

I'm sure there's a way to do this that I'm overlooking. How can I collapse these overlapping records?

Ben Wyatt
  • 399
  • 4
  • 14

1 Answers1

4

It is relatively easy to merge overlapping ranges as elements of an array. For simplicity the following function returns set of tstzrange:

create or replace function merge_ranges(tstzrange[])
returns setof tstzrange language plpgsql as $$
declare
    t tstzrange;
    r tstzrange;
begin
    foreach t in array $1 loop
        if r && t then r:= r + t;
        else
            if r notnull then return next r;
            end if;
            r:= t;
        end if;
    end loop;
    if r notnull then return next r;
    end if;
end $$;

Just aggregate the ranges for a user and use the function:

select fk_user_id, merge_ranges(array_agg(sessionrange))
from test 
group by 1
order by 1, 2

 fk_user_id |                    merge_ranges                     
------------+-----------------------------------------------------
          1 | ["2016-01-14 17:57:01+01","2016-01-14 18:21:56+01"]
          1 | ["2016-01-15 00:18:08+01","2016-01-15 00:18:15+01"]
          1 | ["2016-01-15 00:18:16+01","2016-01-15 00:18:31+01"]
          1 | ["2016-01-15 00:38:12+01","2016-01-15 00:48:20+01"]
          2 | ["2016-01-15 00:18:00+01","2016-01-15 00:42:09+01"]
          3 | ["2016-01-15 00:18:12+01","2016-01-15 00:18:20+01"]
          3 | ["2016-01-15 01:32:12+01","2016-01-15 05:18:20+01"]
          4 | ["2016-01-15 00:18:16+01","2016-01-15 00:18:26+01"]
(8 rows)    

Alternatively, the algorithm can be applied to the entire table in one function loop. I'm not sure but for a large dataset this method should be faster.

create or replace function merge_ranges_in_test()
returns setof test language plpgsql as $$
declare
    curr test;
    prev test;
begin
    for curr in
        select * 
        from test
        order by fk_user_id, sessionrange
    loop
        if prev notnull and prev.fk_user_id <> curr.fk_user_id then
            return next prev;
            prev:= null;
        end if;
        if prev.sessionrange && curr.sessionrange then 
            prev.sessionrange:= prev.sessionrange + curr.sessionrange;
        else
            if prev notnull then 
                return next prev;
            end if;
            prev:= curr;
        end if;
    end loop;
    return next prev;
end $$;

Results:

select *
from merge_ranges_in_test();

 id |                    sessionrange                     | fk_user_id 
----+-----------------------------------------------------+------------
  1 | ["2016-01-14 17:57:01+01","2016-01-14 18:21:56+01"] |          1
  5 | ["2016-01-15 00:18:08+01","2016-01-15 00:18:15+01"] |          1
  7 | ["2016-01-15 00:18:16+01","2016-01-15 00:18:31+01"] |          1
  6 | ["2016-01-15 00:38:12+01","2016-01-15 00:48:20+01"] |          1
  4 | ["2016-01-15 00:18:00+01","2016-01-15 00:42:09+01"] |          2
  9 | ["2016-01-15 00:18:12+01","2016-01-15 00:18:20+01"] |          3
 10 | ["2016-01-15 01:32:12+01","2016-01-15 05:18:20+01"] |          3
 11 | ["2016-01-15 00:18:16+01","2016-01-15 00:18:26+01"] |          4
(8 rows)

The problem is very interesting. I've tried to find a recursive solution but it seems the procedural attempt is most natural and efficient.


I have finally found a recursive solution. The query deletes overlapping rows and inserts their compacted equivalent:

with recursive cte (user_id, ids, range) as (
    select t1.fk_user_id, array[t1.id, t2.id], t1.sessionrange + t2.sessionrange
    from test t1
    join test t2
        on t1.fk_user_id = t2.fk_user_id 
        and t1.id < t2.id
        and t1.sessionrange && t2.sessionrange
union all
    select user_id, ids || t.id, range + sessionrange
    from cte
    join test t
        on user_id = t.fk_user_id 
        and ids[cardinality(ids)] < t.id
        and range && t.sessionrange
    ),
list as (
    select distinct on(id) id, range, user_id
    from cte, unnest(ids) id
    order by id, upper(range)- lower(range) desc
    ),
deleted as (
    delete from test
    where id in (select id from list)
    )
insert into test
select distinct on (range) id, range, user_id
from list
order by range, id;

Results:

select *
from test
order by 3, 2;

 id |                    sessionrange                     | fk_user_id 
----+-----------------------------------------------------+------------
  1 | ["2016-01-14 17:57:01+01","2016-01-14 18:21:56+01"] |          1
  5 | ["2016-01-15 00:18:08+01","2016-01-15 00:18:15+01"] |          1
  7 | ["2016-01-15 00:18:16+01","2016-01-15 00:18:31+01"] |          1
  6 | ["2016-01-15 00:38:12+01","2016-01-15 00:48:20+01"] |          1
  4 | ["2016-01-15 00:18:00+01","2016-01-15 00:42:09+01"] |          2
  9 | ["2016-01-15 00:18:12+01","2016-01-15 00:18:20+01"] |          3
 10 | ["2016-01-15 01:32:12+01","2016-01-15 05:18:20+01"] |          3
 11 | ["2016-01-15 00:18:16+01","2016-01-15 00:18:26+01"] |          4
(8 rows)
klin
  • 112,967
  • 15
  • 204
  • 232
  • I ended up going with the first solution, since it didn't need to be adapted at all to fit my real schema. It's really easy to work with and appears to be correct. I need to do some additional testing to be sure, but I think I'll be coming back here later today to accept your answer. Thank you! – Ben Wyatt Jul 15 '16 at 14:11
  • Managed to do some testing, and it does appear that this combines everything the way I wanted. Thanks! – Ben Wyatt Jul 15 '16 at 14:23
  • 1
    Your question turned out to be a challenge for me. Couldn't stand I can't do that without a function ;) – klin Jul 15 '16 at 14:28