Why does a simple recursive CTE make PostgreSQL resort to an external sort?

Question

Using PostgreSQL 14, I'm working on a table workplaces which models a tree of workplaces. Besides attributes like country_code, each workplace identifies its parent workplace via a parent_id foreign key (which is NULL in case there is no parent):

CREATE TABLE workplaces(
  id SERIAL PRIMARY KEY,
  parent_id INTEGER REFERENCES(workplaces.id),
  ..
)

I'd now like to identify groups of workplaces in this table by picking a couple of interesting rows and then identifying all descendants (i.e. children, grand-children etc.) of that. My thinking was to build a table which maps workplace IDs to their 'root ancestor'. A recursive CTE does the job:

WITH RECURSIVE

"ancestors" AS (
  SELECT
    id AS id,
    id AS ancestor_or_self_id
  FROM
    workplaces
  WHERE country_code = 'DE' AND type = 'clinic' AND duplicate_of IS NULL

  UNION ALL (

  SELECT
    w.id AS id,
    a.ancestor_or_self_id AS ancestor_or_self_id

  FROM workplaces AS w
  JOIN ancestors AS a ON w.parent_id = a.id
  )
)
SELECT COUNT(*) FROM ancestors;
;

What surprised me is that this takes surprisingly long (about a second on my laptop). A few numbers to put this into perspective:

The workplaces table has 295k rows
The base step of the recursion yields 3758 rows.
The entire query produces 35431 rows.
The recursion is at most 4 levels deep at this point.

This is the query plan:

                                                                      QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=789384.97..789384.98 rows=1 width=8) (actual time=795.210..795.246 rows=1 loops=1)
   CTE ancestors
     ->  Recursive Union  (cost=1000.00..723164.45 rows=2943134 width=16) (actual time=41.765..788.647 rows=35431 loops=1)
           ->  Gather  (cost=1000.00..23501.17 rows=3794 width=16) (actual time=41.764..54.481 rows=3758 loops=1)
                 Workers Planned: 2
                 Workers Launched: 2
                 ->  Parallel Seq Scan on workplaces  (cost=0.00..22121.77 rows=1581 width=16) (actual time=37.064..49.939 rows=1253 loops=3)
                       Filter: ((duplicate_of IS NULL) AND ((country_code)::text = 'DE'::text) AND ((type)::text = 'clinic'::text))
                       Rows Removed by Filter: 97135
           ->  Merge Join  (cost=58737.30..64080.06 rows=293934 width=16) (actual time=112.727..119.909 rows=5279 loops=6)
                 Merge Cond: (a.id = w.parent_id)
                 ->  Sort  (cost=3644.41..3739.26 rows=37940 width=16) (actual time=1.360..1.613 rows=5905 loops=6)
                       Sort Key: a.id
                       Sort Method: quicksort  Memory: 25kB
                       ->  WorkTable Scan on ancestors a  (cost=0.00..758.80 rows=37940 width=16) (actual time=0.001..0.344 rows=5905 loops=6)
                 ->  Materialize  (cost=55092.89..56568.70 rows=295163 width=16) (actual time=109.127..115.263 rows=43088 loops=6)
                       ->  Sort  (cost=55092.89..55830.80 rows=295163 width=16) (actual time=109.126..111.831 rows=43088 loops=6)
                             Sort Key: w.parent_id
                             Sort Method: external merge  Disk: 5696kB
                             ->  Seq Scan on workplaces w  (cost=0.00..23228.63 rows=295163 width=16) (actual time=0.007..67.569 rows=295163 loops=6)
   ->  CTE Scan on ancestors  (cost=0.00..58862.68 rows=2943134 width=0) (actual time=41.769..793.929 rows=35431 loops=1)
 Planning Time: 0.377 ms
 Execution Time: 797.738 ms

My reading of this is that most of the time is spent in sorting rows as part of the recursive step (as part of materializing the CTE): it appears PostgreSQL decides to not do this in memory but rather does a merge sort on disk. Why is that though -- shouldn't sorting a table with just two integers and row counts like what I'm dealing with easily fit into memory?

There is just one relevant index on the workplaces table, on the id column (by virtue of it being the primary key). I tried also setting an index on the parent_id column, but that didn't seem to help with anything.

The work_mem setting is documented to affect sorting: it is set to just 4MB on this instance. Bumping it to 32MB helps quite a bit -- it avoids the external disk merge and instead makes PostgreSQL do a quicksort in memory.

Is there a way to accelerate the query without throwing more memory at it? It puzzles me that it's ever trying to sort all 295k workplace rows - my assumption is that it would merely need to sort each recursive step in order to perform a merge sort (and each step yields a much smaller number of rows than 295k).

@FrankHeikens Thanks for that hint: it was set to 4MB. Bumping it to 32MB makes things go quite a bit faster, replacing an external disk merge with an in-memory quicksort. I wonder though: is there a way to accelerate this without increasing the work_mem value? — Frerich Raabe, May 16 '23 at 06:47
By the way, did you create any indexes? The database now has to read the entire table many times and that is rather slow. And when ready, it throws away most of the data — Frank Heikens, May 16 '23 at 06:47
@FrankHeikens Good question! I now added some paragraphs at the end to elaborate on the `work_mem` setting and available indices (short story: there is just an index on `id`, adding an index on `parent_id` didn't seem to make a difference). Are these come columns you have in mind for indices? — Frerich Raabe, May 16 '23 at 06:55
Recursive CTEs aren't simple by definition. To fix the sorting issue you should try to avoid the extra sort altogether, not try to run it in memory. To do that the relevant columns should be indexed: `id`, `parent_id` and possibly `country_code` and `type` too. — Panagiotis Kanavos, May 16 '23 at 07:05
"my assumption is that it would merely need to sort each recursive step in order to perform a merge sort" It is a merge join, not a merge sort. How is it supposed to do a merge join if only one of the join partners is sorted? — jjanes, May 16 '23 at 15:15
This is too many questions. How to make it faster, why does sort work the way it does, why does it sort in the first place, why does it sort in one way versus a different way. — jjanes, May 16 '23 at 15:18
@jjanes I'm sorry for the confusion - the main question is the one I posed in the title: why does PostgreSQL resort to an external sort here. The other questions are really just me thinking out loud, explaining what I considered and tried. As for the sorting: the emphasis on what you quote is on sorting the result of each recursive step (i.e. a couple dozen of rows) instead of sorting the entire `workplaces` table, which is what the `Sort (cost=55092.89..55830.80 rows=295163 width=16)` estimate in the query plan suggests. — Frerich Raabe, May 19 '23 at 07:35

Frank Heikens · Accepted Answer · 2023-05-16T10:45:54.060

2

Without access to the data, I just have to guess what could work for indexing.

CREATE INDEX idx_workplaces_country_type 
ON workplaces(country_code, type) -- maybe first type and then country_code
WHERE duplicate_of IS NULL;

And definitely this one:

CREATE INDEX idx_workplaces_parent_id 
ON workplaces(parent_id);

Please share the new query plan so we can see that something changed/improved;

You could also include the id in the index. This could enable an index-only scan:

CREATE INDEX idx_workplaces_parent_id_include_id 
ON workplaces(parent_id) INCLUDE(id);

random_page_cost should represent the performance of random IO compared to sequential IO. The default value 4 used to work well for rotating disks, but for SSD you should pick a lower value. Something close to 1 is a good starting point, like 1.1

edited May 16 '23 at 10:45

answered May 16 '23 at 07:02

Frank Heikens

117,544
24
142
135

This will fix the real issue - why is an expensive sort needed to begin with? – Panagiotis Kanavos May 16 '23 at 07:13
Thanks a lot for the feedback! Adding an index on country_code/type/duplicate_of helps with the base case -- but the base case is very fast anyway, i.e. the index doesn't have a noticeable impact on runtime. An index on `parent_id` I already tried -- but that doesn't seem to get picked up. Here's the new query plan after applying the above two indices and running `VACUUM ANALYZE`: https://gist.github.com/frerich/dd64b0e5346e562405d79da44c41a08f – Frerich Raabe May 16 '23 at 07:33
@FrerichRaabe: Could you please share the results from EXPLAIN(ANALYZE, VERBOSE, BUFFERS, SETTINGS) for this statement? And I'm interested in the setting for random_page_cost – Frank Heikens May 16 '23 at 08:09
@FrankHeikens Sure - thanks again for your perseverance! The query plan generated by `EXPLAIN(ANALYZE, VERBOSE, BUFFERS, SETTINGS)` is available at https://gist.github.com/frerich/7141ac1e03008c1f1fe1926c9eb1317e . This is with `work_mem` on `4MB`. Running `SHOW random_page_cost;` prints `4`. – Frerich Raabe May 16 '23 at 09:29
@FrerichRaabe: Could you change random_page_cost to something like 1.1 ? And try again. You could also try (just for a test) to turn of enable_mergejoin, set enable_mergejoin to off; There is massive over estimate in the number of records and you're running a default PostgreSQL configuration that is not well suited for performance. – Frank Heikens May 16 '23 at 09:37
@FrankHeikens Wow, amazing tips there! Setting random_page_cost to 1.1 brought the runtime down from 790ms to 170ms. I can see that indeed, it's estimating a much lower number of records now. Setting enable_mergejoin=off brings the runtime down even further, to ~90ms - it now uses the index on the parent_id! You can find the query plans at https://gist.github.com/frerich/48baa81cda87044f26d59a1d3c06a26b I'm puzzled! Is this something you'd typically set for individual queries? – Frerich Raabe May 16 '23 at 09:56
@FrerichRaabe: No, only in rare exceptions I would recommend to change a configuration parameter for a single query. The configuration should however match your hardware and usage pattern. The default configuration will never match because nobody has default hardware and default usage (whatever that my be). A different setting for random_page_cost is always recommended, turning off enable_mergejoin only for tests and in rare exceptions. A better configuration and (custom) statistics should solve issues with merge joins. – Frank Heikens May 16 '23 at 10:39
@FrerichRaabe: Try the new index using INCLUDE as well, to enable an index-only scan. – Frank Heikens May 16 '23 at 10:46
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/253675/discussion-between-frerich-raabe-and-frank-heikens). – Frerich Raabe May 16 '23 at 11:50

Why does a simple recursive CTE make PostgreSQL resort to an external sort?

1 Answers1