How to improve GROUP BY performance with CTE in Postgres

Question

I have the following simplified schema in Postgres12 whereby a user has access to many cfiles and every cfile can be accessed by many users.

SELECT * FROM cfiles 
 id |        name         |                  property_values (jsonb)                 |
----+---------------------+----------------------------------------------------------+
  1 | sampleA_aligned.bam | {"Project": "Proj1", "Sample Names": ["sampA"]}          |
  2 | sampleA_aligned.bai | {"Project": "Proj1", "Sample Names": ["sampA"]}          |
  3 | sampBC_qc.pdf       | {"Project": "Proj2", "Sample Names": ["sampB", "sampC"]} |

SELECT * FROM users
 id |    email     |
----+--------------+
  1 | usr1@svr.com |
  2 | usr2@svr.com |

SELECT * FROM user_permissions
 id | cfile_id | user_id |
----+----------+---------+
  1 |        1 |       1 |
  2 |        2 |       1 |
  3 |        2 |       2 |

I am trying to speed up a query to produce this file access summary output below:

       cfile_name    | project_ids |    sample_names    |        permissions         |                 
---------------------+-------------+--------------------+----------------------------+
 sampleA_aligned.bam |  ["Proj1"]  | ["sampA"]          | usr1@svr.com               |
 sampleA_aligned.bai |  ["Proj1"]  | ["sampA"]          | usr1@svr.com, usr2@svr.com |
 sampBC_qc.pdf       |  ["Proj2"]  | ["sampB", "sampC"] |                            |

We found that when fetching jsonb (eg the cfiles property_values column) using a CTE pattern had a significant performance gain due to the deTOASTing of large jsonb data (see SO answer here) so I arrived at the query below:

WITH cf as (
  SELECT cfiles.property_values as prop_vals,
  string_agg(DISTINCT users.email, ', ') as permissions,
  cfiles.name as cfile_name
  FROM cfiles
  LEFT JOIN user_permissions ON user_permissions.cfile_id=cfiles.id
  LEFT JOIN users on users.id=user_permissions.user_id
  -- I may need to filter here, eg:
  -- WHERE cfiles.name like '%.bam'
  -- AND property_values ->> 'Sample Names' LIKE '%A%'
  -- AND users.email='usr1@svr.com' ...
  GROUP BY property_values, cfile_name
  ORDER BY permissions NULLS LAST -- I may need to order by aggregated field
  LIMIT 20 -- I may need to paginate
) SELECT
cfile_name,
prop_vals ->> 'Project IDs' as project_ids,
prop_vals ->> 'Sample Names' as sample_names,
permissions
FROM cf

The problem is that with 46K cfiles, using the GROUP BY above takes almost 20 seconds, query plan below:

                                                                                                                                                                                                                                       QUERY PLAN                                                                                                                                                                                                                                        
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Subquery Scan on cf  (cost=19356.80..19357.15 rows=20 width=119) (actual time=19958.754..19958.829 rows=20 loops=1)
   Output: cf.cfile_name, cf.permissions, (cf.prop_vals ->> 'Project IDs'::text), (cf.prop_vals ->> 'Sample Names'::text)
   ->  Limit  (cost=19356.80..19356.85 rows=20 width=967) (actual time=19958.738..19958.754 rows=20 loops=1)
         Output: cfiles.property_values, (string_agg(DISTINCT (users.email)::text, ', '::text)), cfiles.name, cfiles.property_values
         ->  Sort  (cost=19356.80..19392.77 rows=14388 width=967) (actual time=19958.734..19958.745 rows=20 loops=1)
               Output: cfiles.property_values, (string_agg(DISTINCT (users.email)::text, ', '::text)), cfiles.name, cfiles.property_values
               Sort Key: (string_agg(DISTINCT (users.email)::text, ', '::text))
               Sort Method: top-N heapsort  Memory: 35kB
               ->  GroupAggregate  (cost=18351.37..18973.94 rows=14388 width=967) (actual time=13668.666..19928.374 rows=20283 loops=1)
                     Output: cfiles.property_values, string_agg(DISTINCT (users.email)::text, ', '::text), cfiles.name, cfiles.property_values
                     Group Key: cfiles.property_values, cfiles.name
                     ->  Sort  (cost=18351.37..18462.05 rows=44272 width=506) (actual time=13668.592..18318.012 rows=44257 loops=1)
                           Output: cfiles.name, cfiles.property_values, users.email
                           Sort Key: cfiles.property_values, cfiles.name
                           Sort Method: external merge  Disk: 21664kB
                           ->  Hash Left Join  (cost=3.19..4795.38 rows=44272 width=506) (actual time=0.213..82.987 rows=44257 loops=1)
                                 Output: cfiles.name, cfiles.property_values, users.email
                                 Hash Cond: (cfiles.id = user_permissions.cfile_id)
                                 ->  Seq Scan on public.cfiles  (cost=0.00..4570.70 rows=44272 width=487) (actual time=0.046..34.417 rows=44255 loops=1)
                                       Output: cfiles.id, cfiles.tid, cfiles.uuid, cfiles.dataset_id, cfiles.path, cfiles.name, cfiles.checksum, cfiles.size, cfiles.last_modified, cfiles.content_type, cfiles.locked, cfiles.property_values, cfiles.created_at, cfiles.updated_at
                                       Filter: (cfiles.tid = 5)
                                       Rows Removed by Filter: 1561
                                 ->  Hash  (cost=2.99..2.99 rows=16 width=35) (actual time=0.142..0.146 rows=16 loops=1)
                                       Output: user_permissions.cfile_id, users.email
                                       Buckets: 1024  Batches: 1  Memory Usage: 10kB
                                       ->  Hash Left Join  (cost=1.79..2.99 rows=16 width=35) (actual time=0.100..0.130 rows=19 loops=1)
                                             Output: user_permissions.cfile_id, users.email
                                             Inner Unique: true
                                             Hash Cond: (user_permissions.user_id = users.id)
                                             ->  Seq Scan on public.user_permissions  (cost=0.00..1.16 rows=16 width=16) (actual time=0.010..0.016 rows=19 loops=1)
                                                   Output: user_permissions.id, user_permissions.tid, user_permissions.user_id, user_permissions.dataset_id, user_permissions.cfile_id, user_permissions.read, user_permissions.share, user_permissions.write_meta, user_permissions.manage_files, user_permissions.delete_files, user_permissions.task_id, user_permissions.notified_at, user_permissions.first_downloaded_at, user_permissions.created_at, user_permissions.updated_at
                                             ->  Hash  (cost=1.35..1.35 rows=35 width=35) (actual time=0.062..0.063 rows=35 loops=1)
                                                   Output: users.email, users.id
                                                   Buckets: 1024  Batches: 1  Memory Usage: 11kB
                                                   ->  Seq Scan on public.users  (cost=0.00..1.35 rows=35 width=35) (actual time=0.012..0.031 rows=35 loops=1)
                                                         Output: users.email, users.id
 Planning Time: 1.473 ms
 Execution Time: 19963.496 ms

If I take out the aggregation and grouping the query executes in just 60ms, eg:

...
-- string_agg(DISTINCT users.email, ', ') as permissions,
users.email as permissions,
...
-- GROUP BY property_values, cfile_name
...

Is there any way I can restructure this query to keep the CTE pattern for jsonb performance but replace or improve the performance of the GROUP BY?

I still need to be able to filter and sort on all fields (as noted in the comments).

I found this stack exchange about replacing group bys with sub-queries on sql server and was hoping I might be able to use a lateral join in Postgres but I couldn't get it to aggregate and group the rows.

Updates:

I tried SET work_mem TO '100MB' that improved by ~1 second
I tried adding indexes on cfiles((property_values->>'Project IDs')), cfiles((property_values->>'Sample Names')) and cfile(name) that improved by ~1 second

This appears to be getting me what I want and in just 70ms - will update after I've tested more

WITH cf as (
  SELECT
  cfiles.property_values as prop_vals,
  users.email,
  cfiles.name as cfile_name,
  cfiles.id as cfile_id
  FROM cfiles
  LEFT JOIN user_permissions ON user_permissions.cfile_id=cfiles.id
  LEFT JOIN users on users.id=user_permissions.user_id
  ORDER BY email NULLS LAST
  LIMIT 20
)
SELECT
cf.cfile_name,
cf.prop_vals ->> 'Project IDs' as project_ids,
cf.prop_vals ->> 'Sample Names' as sample_names,
permissions
FROM (
  SELECT
  cfile_id,
  string_agg(DISTINCT(user_email), ', ') as permissions
  FROM (
    SELECT
    cf.cfile_id,
    users.email AS user_email
    FROM cf 
    LEFT JOIN user_permissions ON (user_permissions.cfile_id=cf.cfile_id)
    LEFT JOIN users on users.id=user_permissions.user_id
  ) g GROUP BY cfile_id
) h JOIN cf ON (cf.cfile_id=h.cfile_id)

Sort Method: external merge Disk: 21,664kB. Could you increase work_mem to make sure the sort is done in memory? SET work_mem TO '100MB'; or something like that. You can also change this in your configuration file or connection default. — Frank Heikens, Feb 09 '21 at 08:03
Something else, do you have an index on property_values, cfile_name ? — Frank Heikens, Feb 09 '21 at 08:13
updated thanks @FrankHeikens - gave me a nice 2 second improvement but query still taking ~18 seconds — simj, Feb 09 '21 at 16:31

How to improve GROUP BY performance with CTE in Postgres

0 Answers0