How to search table recursively by given path for a tree-like structure in postgresql-10

Question

I have a table like this:

+-----------------------------------+
| id | client_id | main_id |  name  |
|-----------------------------------|
| 1  | 1         | NULL    | hello  |
| 2  | 1         | 1       | hello2 |
| 3  | 1         | 2       | hello3 |
| 4  | 2         | NULL    | hello  |
| 5  | 2         | 4       | hello2 |
| 6  | 2         | 5       | hello3 |
+-----------------------------------+

What I want to get id:3 by giving /hello/hello2/hello3 and client_id because /hello3 belongs to hello2 and hello2 belongs to hello. When I give a full path I want to return last path ID.

This is my table schema:

CREATE TABLE "public"."paths" (
  "id" serial8,
  "client_id" int8,
  "main_id" int8,
  "name" varchar(255) NOT NULL,
  FOREIGN KEY ("main_id") REFERENCES "public"."paths" ("id")
)
;
-- INDEX ON CLIENT ID's.
CREATE INDEX "cid" ON "public"."paths" USING btree (
  "client_id"
);

So far, With recursive I tried this:

WITH RECURSIVE full_paths AS
(SELECT id, name, main_id, CAST(name As varchar(1000)) As fname
FROM paths
WHERE client_id = 1
UNION ALL
SELECT x.id, x.name, x.main_id, CAST(y.fname || '/' || x.name As varchar(1000)) As fname
FROM paths As x
    INNER JOIN full_paths AS y ON (x.main_id = y.id)
)
SELECT id, fname FROM full_paths WHERE fname = '/home/home2/home3';

But I have a million of records in my table and this slow downs the request by querying whole table.

Also see below for EXPLAIN:

CTE Scan on full_paths  (cost=4383987797.32..7008489047.29 rows=583222500 width=40) (actual time=1254.573..1675.192 rows=1 loops=1)
  Filter: (fname = '/home/home2/home3'::text)
  Rows Removed by Filter: 482943
  Buffers: shared hit=23754, temp read=8510 written=13548
  CTE full_paths
    ->  Recursive Union  (cost=0.00..4383987797.32 rows=116644499999 width=61) (actual time=0.015..1476.644 rows=482944 loops=1)
          Buffers: shared hit=23754, temp read=8510 written=10261
          ->  Seq Scan on paths  (cost=0.00..13955.49 rows=482999 width=42) (actual time=0.013..127.433 rows=482943 loops=1)
                Filter: (client_id = 24)
                Rows Removed by Filter: 3
                Buffers: shared hit=7918
          ->  Merge Join  (cost=966864.46..205108384.18 rows=11664401700 width=61) (actual time=600.989..600.990 rows=0 loops=2)
                Merge Cond: (x.main_id = y.id)
                Buffers: shared hit=15836, temp read=8510 written=6974
                ->  Sort  (cost=69904.11..71111.60 rows=482999 width=29) (actual time=276.900..360.597 rows=482946 loops=2)
                      Sort Key: x.main_id
                      Sort Method: external sort  Disk: 19848kB
                      Buffers: shared hit=15836, temp read=4962 written=4962
                      ->  Seq Scan on paths x  (cost=0.00..12747.99 rows=482999 width=29) (actual time=0.010..106.355 rows=482946 loops=2)
                            Buffers: shared hit=15836
                ->  Materialize  (cost=896960.36..921110.31 rows=4829990 width=40) (actual time=192.873..192.876 rows=3 loops=2)
                      Buffers: temp read=3548 written=2012
                      ->  Sort  (cost=896960.36..909035.33 rows=4829990 width=40) (actual time=191.121..191.122 rows=3 loops=2)
                            Sort Key: y.id
                            Sort Method: quicksort  Memory: 25kB
                            Buffers: temp read=3548 written=2012
                            ->  WorkTable Scan on full_paths y  (cost=0.00..96599.80 rows=4829990 width=40) (actual time=0.012..44.830 rows=241472 loops=2)
                                  Buffers: temp read=3289 written=1
Planning time: 0.261 ms
Execution time: 1685.199 ms

How can I write a proper and effective fast query? Do I need to write functions (which I do not know how I'll be happy if you provide a sample function)?

Is the `name` attribute unique? If so, you could just parse the query string for the last node before even running the query. — Zack, Oct 11 '18 at 16:41
@Zack, no, not unique. I've added one column and more rows to illustrate exact need. — Dennis, Oct 11 '18 at 16:42
Well, you've structured your query to make a gigantic table of all possible full paths, then to look for "/home/home2/home3". You're guaranteed to get faster if you, say, build a table of every path that starts with "home". — jmelesky, Oct 11 '18 at 16:45
@jmelesky, those each name can be renamed by client. That's why I separately added and connected via foreign key. If you believe this is wrong could you in light me so I can properly build my structure? :) — Dennis, Oct 11 '18 at 16:49
Not sure if this is viable for you, but what about representing the hierarchy path as a separate attribute, using [materialized path](https://bojanz.wordpress.com/2014/04/25/storing-hierarchical-data-materialized-path/)? — Zack, Oct 11 '18 at 16:50
I've looked it. It looks promising. I will try to illustrate a sample table and data for this scenario. But I'm still open for any help on above schema. — Dennis, Oct 11 '18 at 17:02
1.6 seconds doesn't seem so bad. However, you should create and index on `(main_id, id)` to support that query. You also might want to increase work_mem to get rid of the sorting done on disk — , Oct 11 '18 at 17:29
@a_horse_with_no_name, done as you said. Timing degraded from 1.6 to 1. But klin's solution is much faster. 0.2s. — Dennis, Oct 11 '18 at 17:38

klin · Accepted Answer · 2018-10-11T17:56:07.830

3

You should filter visited rows by appropriate parts (names) of the wanted path. Add an auxiliary query (pattern) to convert the input path to an array and use elements of the array to shed off unnecessary rows.

with recursive pattern(pattern) as (
    select string_to_array('hello/hello2/hello3', '/') -- input
),
full_paths as (
    select id, main_id, name, 1 as idx
    from paths
    cross join pattern
    where client_id = 1 and name = pattern[1]
union all
    select x.id, x.main_id, x.name, idx+ 1
    from paths as x
    cross join pattern
    inner join full_paths as y 
        on x.main_id = y.id 
        and x.name = pattern[idx+ 1]
)
select id, name
from full_paths
cross join pattern
where idx = cardinality(pattern)

Working example in rextester.

edited Oct 11 '18 at 17:56

answered Oct 11 '18 at 17:10

klin

112,967
15
204
232

thank you! I tried this solution and execution time degraded from 2 seconds to 0.2s. – Dennis Oct 11 '18 at 17:26
ah one more thing. If last folder not found then one previous (base/main) folder returns. How can I make sure exact path is returned otherwise NULL. – Dennis Oct 11 '18 at 17:40
1

The simplest way is to check whether the result contains all parts of pattern. See the updated answer. – klin Oct 11 '18 at 17:56
hey @klin i discovered that where conditions not working. See: https://rextester.com/UYZISZ95893. isok is true but false value returns. Any idea? – Dennis Nov 14 '18 at 23:18
1

Use the condition also in the recursive part - [rextester.](https://rextester.com/URKOZ53775) – klin Nov 15 '18 at 07:21
Thank you! Btw, If I do this, is it also correct? Instead of where i put them in "ON" See https://rextester.com/ALGA69594 – Dennis Nov 15 '18 at 14:23
1

Yes, it's ok. The optimizer will generate the same plans in both cases (in this particular query, it's not always true). – klin Nov 15 '18 at 16:02

How to search table recursively by given path for a tree-like structure in postgresql-10

1 Answers1