PostgreSQL efficiently find last decendant in linear list

Question

I currently try to retrieve the last decendet efficiently from a linked list like structure.

Essentially there's a table with a data series, with certain criteria I split it up to get a list like this

current_id | next_id

for example

1  | 2
2  | 3
3  | 4
4  | NULL
42 | 43
43 | 45
45 | NULL
etc...

would result in lists like

1 -> 2 -> 3 -> 4

and

42 -> 43 -> 45

Now I want to get the first and the last id from each of those lists.

This is what I have right now:

WITH RECURSIVE contract(ruid, rdid, rstart_ts, rend_ts) AS ( -- recursive Query to traverse the "linked list" of continuous timestamps
    SELECT start_ts, end_ts FROM track_caps tc
    UNION
    SELECT c.rstart_ts, tc.end_ts AS end_ts0 FROM contract c INNER JOIN track_caps tc ON (tc.start_ts = c.rend_ts AND c.rend_ts IS NOT NULL AND tc.end_ts IS NOT NULL)
),
fcontract AS ( --final step, after traversing the "linked list", pick the largest timestamp found as the end_ts and the smallest as the start_ts
    SELECT DISTINCT ON(start_ts, end_ts) min(rstart_ts) AS start_ts, rend_ts AS end_ts
    FROM (
        SELECT rstart_ts, max(rend_ts) AS rend_ts FROM contract
        GROUP BY rstart_ts
    ) sq
    GROUP BY end_ts
)
SELECT * FROM fcontract
ORDER BY start_ts

In this case I just used timestamps which work fine for the given data.

Basically I just use a recursive query that walks through all the nodes until it reaches the end, as suggested by many other posts on StackOverflow and other sites. The next query removes all the sub-steps and returns what I want, like in the first list example: 1 | 4

Just for illustration, the produced result set by the recursive query looks like this:

As nicely as it works, it's quite a memory hog however which is absolutely unsurprising when looking at the results of EXPLAIN ANALYZE. For a dataset of roughly 42,600 rows, the recursive query produces a whopping 849,542,346 rows. Now it was actually supposed to process around 2,000,000 rows but with that solution right now it seems very unfeasible.

Did I just improperly use recursive queries? Is there a way to reduce the amount of data it produces?(like removing the sub-steps?) Or are there better single-query solutions to this problem?

Maybe I'm missing something but isn't that simply `where next_id is not null`? — , Mar 06 '17 at 22:04
@a_horse_with_no_name but how would I get the very first id of the list that belongs to the last id selected with the where clause? — rfreytag, Mar 06 '17 at 22:08

score 2 · Accepted Answer · answered Mar 06 '17 at 22:25

The main problem is that your recursive query doesn't properly filter the root nodes which is caused by the the model you have. So the non-recursive part already selects the entire table and then Postgres needs to recurse for each and every row of the table.

To make that more efficient only select the root nodes in the non-recursive part of your query. This can be done using:

select t1.current_id, t1.next_id, t1.current_id as root_id
from track_caps t1
where not exists (select * 
                  from track_caps t2
                  where t2.next_id = t1.current_id)

Now that is still not very efficient (compared to the "usual" where parent_id is null design), but at least makes sure the recursion doesn't need to process more rows then necessary.

To find the root node of each tree, just select that as an extra column in the non-recursive part of the query and carry it over to each row in the recursive part.

So you wind up with something like this:

with recursive contract as (
  select t1.current_id, t1.next_id, t1.current_id as root_id
  from track_caps t1
  where not exists (select * 
                    from track_caps t2
                    where t2.next_id = t1.current_id)
  union 
  select c.current_id, c.next_id, p.root_id
  from track_caps c
    join contract p on c.current_id = p.next_id
  and c.next_id is not null
)
select *
from contract
order by current_id;

Online example: http://rextester.com/DOABC98823

not bad, it halfed the the amount of rows. But you mentioned the other "design". Wouldn't I face pretty much the same issue there? There I would just easily get the first node instead and had to search for the last one. Here I have the opposite. Getting the last node is easy but I have to crawl through the list for the first node in a sense. — rfreytag, Mar 06 '17 at 22:59

PostgreSQL efficiently find last decendant in linear list

1 Answers1