For a small Postgres 10 data warehouse I was checking for improvements in our analytics queries and discovered a rather slow query where the possible improvement basically boiled down to this subquery (classic greatest-n-per-group problem):
SELECT s_postings.*
FROM dwh.s_postings
JOIN (SELECT s_postings.id,
max(s_postings.load_dts) AS load_dts
FROM dwh.s_postings
GROUP BY s_postings.id) AS current_postings
ON s_postings.id = current_postings.id AND s_postings.load_dts = current_postings.load_dts
With the following execution plan:
"Gather (cost=23808.51..38602.59 rows=66 width=376) (actual time=1385.927..1810.844 rows=170847 loops=1)"
" Workers Planned: 2"
" Workers Launched: 2"
" -> Hash Join (cost=22808.51..37595.99 rows=28 width=376) (actual time=1199.647..1490.652 rows=56949 loops=3)"
" Hash Cond: (((s_postings.id)::text = (s_postings_1.id)::text) AND (s_postings.load_dts = (max(s_postings_1.load_dts))))"
" -> Parallel Seq Scan on s_postings (cost=0.00..14113.25 rows=128425 width=376) (actual time=0.016..73.604 rows=102723 loops=3)"
" -> Hash (cost=20513.00..20513.00 rows=153034 width=75) (actual time=1195.616..1195.616 rows=170847 loops=3)"
" Buckets: 262144 Batches: 1 Memory Usage: 20735kB"
" -> HashAggregate (cost=17452.32..18982.66 rows=153034 width=75) (actual time=836.694..1015.499 rows=170847 loops=3)"
" Group Key: s_postings_1.id"
" -> Seq Scan on s_postings s_postings_1 (cost=0.00..15911.21 rows=308221 width=75) (actual time=0.032..251.122 rows=308168 loops=3)"
"Planning time: 1.184 ms"
"Execution time: 1912.865 ms"
The row estimate is absolutely wrong! For me the weird thing is if I change the join to a right join now:
SELECT s_postings.*
FROM dwh.s_postings
RIGHT JOIN (SELECT s_postings.id,
max(s_postings.load_dts) AS load_dts
FROM dwh.s_postings
GROUP BY s_postings.id) AS current_postings
ON s_postings.id = current_postings.id AND s_postings.load_dts = current_postings.load_dts
With the execution plan:
"Hash Right Join (cost=22829.85..40375.62 rows=153177 width=376) (actual time=814.097..1399.673 rows=170848 loops=1)"
" Hash Cond: (((s_postings.id)::text = (s_postings_1.id)::text) AND (s_postings.load_dts = (max(s_postings_1.load_dts))))"
" -> Seq Scan on s_postings (cost=0.00..15926.10 rows=308510 width=376) (actual time=0.011..144.584 rows=308419 loops=1)"
" -> Hash (cost=20532.19..20532.19 rows=153177 width=75) (actual time=812.587..812.587 rows=170848 loops=1)"
" Buckets: 262144 Batches: 1 Memory Usage: 20735kB"
" -> HashAggregate (cost=17468.65..19000.42 rows=153177 width=75) (actual time=553.633..683.850 rows=170848 loops=1)"
" Group Key: s_postings_1.id"
" -> Seq Scan on s_postings s_postings_1 (cost=0.00..15926.10 rows=308510 width=75) (actual time=0.011..157.000 rows=308419 loops=1)"
"Planning time: 0.402 ms"
"Execution time: 1469.808 ms"
The row estimate is way better!
I am aware that for example parallel sequential scans can in some conditions decrease performance but they should not change the row estimate!?
If I remember correctly aggregate functions also block the proper use of indexes anyway and also don’t see any potential gains with additional multivariate statistics e.g. for the tuple id, load_dts
. The database is VACUUM ANALYZE
d.
For me the queries are logically the same.
Is there a way to support the query planner to make better assumptions about the estimates or improve the query? Maybe somebody knows a reason why this difference exists?
Edit: Previously the join condition was ON s_postings.id::text = current_postings.id::text
I changed that to ON s_postings.id = current_postings.id
to not confuse anybody. Removing this conversion does not change the query plan.
Edit2: As suggested below there is a different solution to the greatest-n-per-group
problem:
SELECT p.*
FROM (SELECT p.*,
RANK() OVER (PARTITION BY p.id ORDER BY p.load_dts DESC) as seqnum
FROM dwh.s_postings p
) p
WHERE seqnum = 1;
A really nice solution but sadly the query planner also underestimates the row count:
"Subquery Scan on p (cost=44151.67..54199.31 rows=1546 width=384) (actual time=1742.902..2594.359 rows=171269 loops=1)"
" Filter: (p.seqnum = 1)"
" Rows Removed by Filter: 137803"
" -> WindowAgg (cost=44151.67..50334.83 rows=309158 width=384) (actual time=1742.899..2408.240 rows=309072 loops=1)"
" -> Sort (cost=44151.67..44924.57 rows=309158 width=376) (actual time=1742.887..1927.325 rows=309072 loops=1)"
" Sort Key: p_1.id, p_1.load_dts DESC"
" Sort Method: quicksort Memory: 172275kB"
" -> Seq Scan on s_postings p_1 (cost=0.00..15959.58 rows=309158 width=376) (actual time=0.007..221.240 rows=309072 loops=1)"
"Planning time: 0.149 ms"
"Execution time: 2666.645 ms"