I have the following simplified schema in Postgres12 whereby a user has access to many cfiles and every cfile can be accessed by many users.
SELECT * FROM cfiles
id | name | property_values (jsonb) |
----+---------------------+----------------------------------------------------------+
1 | sampleA_aligned.bam | {"Project": "Proj1", "Sample Names": ["sampA"]} |
2 | sampleA_aligned.bai | {"Project": "Proj1", "Sample Names": ["sampA"]} |
3 | sampBC_qc.pdf | {"Project": "Proj2", "Sample Names": ["sampB", "sampC"]} |
SELECT * FROM users
id | email |
----+--------------+
1 | usr1@svr.com |
2 | usr2@svr.com |
SELECT * FROM user_permissions
id | cfile_id | user_id |
----+----------+---------+
1 | 1 | 1 |
2 | 2 | 1 |
3 | 2 | 2 |
I am trying to speed up a query to produce this file access summary output below:
cfile_name | project_ids | sample_names | permissions |
---------------------+-------------+--------------------+----------------------------+
sampleA_aligned.bam | ["Proj1"] | ["sampA"] | usr1@svr.com |
sampleA_aligned.bai | ["Proj1"] | ["sampA"] | usr1@svr.com, usr2@svr.com |
sampBC_qc.pdf | ["Proj2"] | ["sampB", "sampC"] | |
We found that when fetching jsonb (eg the cfiles property_values
column) using a CTE pattern had a significant performance gain due to the deTOASTing of large jsonb data (see SO answer here) so I arrived at the query below:
WITH cf as (
SELECT cfiles.property_values as prop_vals,
string_agg(DISTINCT users.email, ', ') as permissions,
cfiles.name as cfile_name
FROM cfiles
LEFT JOIN user_permissions ON user_permissions.cfile_id=cfiles.id
LEFT JOIN users on users.id=user_permissions.user_id
-- I may need to filter here, eg:
-- WHERE cfiles.name like '%.bam'
-- AND property_values ->> 'Sample Names' LIKE '%A%'
-- AND users.email='usr1@svr.com' ...
GROUP BY property_values, cfile_name
ORDER BY permissions NULLS LAST -- I may need to order by aggregated field
LIMIT 20 -- I may need to paginate
) SELECT
cfile_name,
prop_vals ->> 'Project IDs' as project_ids,
prop_vals ->> 'Sample Names' as sample_names,
permissions
FROM cf
The problem is that with 46K cfiles, using the GROUP BY
above takes almost 20 seconds, query plan below:
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Subquery Scan on cf (cost=19356.80..19357.15 rows=20 width=119) (actual time=19958.754..19958.829 rows=20 loops=1)
Output: cf.cfile_name, cf.permissions, (cf.prop_vals ->> 'Project IDs'::text), (cf.prop_vals ->> 'Sample Names'::text)
-> Limit (cost=19356.80..19356.85 rows=20 width=967) (actual time=19958.738..19958.754 rows=20 loops=1)
Output: cfiles.property_values, (string_agg(DISTINCT (users.email)::text, ', '::text)), cfiles.name, cfiles.property_values
-> Sort (cost=19356.80..19392.77 rows=14388 width=967) (actual time=19958.734..19958.745 rows=20 loops=1)
Output: cfiles.property_values, (string_agg(DISTINCT (users.email)::text, ', '::text)), cfiles.name, cfiles.property_values
Sort Key: (string_agg(DISTINCT (users.email)::text, ', '::text))
Sort Method: top-N heapsort Memory: 35kB
-> GroupAggregate (cost=18351.37..18973.94 rows=14388 width=967) (actual time=13668.666..19928.374 rows=20283 loops=1)
Output: cfiles.property_values, string_agg(DISTINCT (users.email)::text, ', '::text), cfiles.name, cfiles.property_values
Group Key: cfiles.property_values, cfiles.name
-> Sort (cost=18351.37..18462.05 rows=44272 width=506) (actual time=13668.592..18318.012 rows=44257 loops=1)
Output: cfiles.name, cfiles.property_values, users.email
Sort Key: cfiles.property_values, cfiles.name
Sort Method: external merge Disk: 21664kB
-> Hash Left Join (cost=3.19..4795.38 rows=44272 width=506) (actual time=0.213..82.987 rows=44257 loops=1)
Output: cfiles.name, cfiles.property_values, users.email
Hash Cond: (cfiles.id = user_permissions.cfile_id)
-> Seq Scan on public.cfiles (cost=0.00..4570.70 rows=44272 width=487) (actual time=0.046..34.417 rows=44255 loops=1)
Output: cfiles.id, cfiles.tid, cfiles.uuid, cfiles.dataset_id, cfiles.path, cfiles.name, cfiles.checksum, cfiles.size, cfiles.last_modified, cfiles.content_type, cfiles.locked, cfiles.property_values, cfiles.created_at, cfiles.updated_at
Filter: (cfiles.tid = 5)
Rows Removed by Filter: 1561
-> Hash (cost=2.99..2.99 rows=16 width=35) (actual time=0.142..0.146 rows=16 loops=1)
Output: user_permissions.cfile_id, users.email
Buckets: 1024 Batches: 1 Memory Usage: 10kB
-> Hash Left Join (cost=1.79..2.99 rows=16 width=35) (actual time=0.100..0.130 rows=19 loops=1)
Output: user_permissions.cfile_id, users.email
Inner Unique: true
Hash Cond: (user_permissions.user_id = users.id)
-> Seq Scan on public.user_permissions (cost=0.00..1.16 rows=16 width=16) (actual time=0.010..0.016 rows=19 loops=1)
Output: user_permissions.id, user_permissions.tid, user_permissions.user_id, user_permissions.dataset_id, user_permissions.cfile_id, user_permissions.read, user_permissions.share, user_permissions.write_meta, user_permissions.manage_files, user_permissions.delete_files, user_permissions.task_id, user_permissions.notified_at, user_permissions.first_downloaded_at, user_permissions.created_at, user_permissions.updated_at
-> Hash (cost=1.35..1.35 rows=35 width=35) (actual time=0.062..0.063 rows=35 loops=1)
Output: users.email, users.id
Buckets: 1024 Batches: 1 Memory Usage: 11kB
-> Seq Scan on public.users (cost=0.00..1.35 rows=35 width=35) (actual time=0.012..0.031 rows=35 loops=1)
Output: users.email, users.id
Planning Time: 1.473 ms
Execution Time: 19963.496 ms
If I take out the aggregation and grouping the query executes in just 60ms, eg:
...
-- string_agg(DISTINCT users.email, ', ') as permissions,
users.email as permissions,
...
-- GROUP BY property_values, cfile_name
...
Is there any way I can restructure this query to keep the CTE pattern for jsonb performance but replace or improve the performance of the GROUP BY?
I still need to be able to filter and sort on all fields (as noted in the comments).
I found this stack exchange about replacing group bys with sub-queries on sql server and was hoping I might be able to use a lateral join in Postgres but I couldn't get it to aggregate and group the rows.
Updates:
- I tried
SET work_mem TO '100MB'
that improved by ~1 second - I tried adding indexes on
cfiles((property_values->>'Project IDs'))
,cfiles((property_values->>'Sample Names'))
andcfile(name)
that improved by ~1 second
This appears to be getting me what I want and in just 70ms - will update after I've tested more
WITH cf as (
SELECT
cfiles.property_values as prop_vals,
users.email,
cfiles.name as cfile_name,
cfiles.id as cfile_id
FROM cfiles
LEFT JOIN user_permissions ON user_permissions.cfile_id=cfiles.id
LEFT JOIN users on users.id=user_permissions.user_id
ORDER BY email NULLS LAST
LIMIT 20
)
SELECT
cf.cfile_name,
cf.prop_vals ->> 'Project IDs' as project_ids,
cf.prop_vals ->> 'Sample Names' as sample_names,
permissions
FROM (
SELECT
cfile_id,
string_agg(DISTINCT(user_email), ', ') as permissions
FROM (
SELECT
cf.cfile_id,
users.email AS user_email
FROM cf
LEFT JOIN user_permissions ON (user_permissions.cfile_id=cf.cfile_id)
LEFT JOIN users on users.id=user_permissions.user_id
) g GROUP BY cfile_id
) h JOIN cf ON (cf.cfile_id=h.cfile_id)