Query plan difference inner join/right join "greatest-n-per-group", self joined, aggregated query

Question

For a small Postgres 10 data warehouse I was checking for improvements in our analytics queries and discovered a rather slow query where the possible improvement basically boiled down to this subquery (classic greatest-n-per-group problem):

SELECT s_postings.*
FROM dwh.s_postings
JOIN (SELECT s_postings.id,
          max(s_postings.load_dts) AS load_dts
      FROM dwh.s_postings
      GROUP BY s_postings.id) AS current_postings
ON s_postings.id = current_postings.id AND s_postings.load_dts = current_postings.load_dts

With the following execution plan:

"Gather  (cost=23808.51..38602.59 rows=66 width=376) (actual time=1385.927..1810.844 rows=170847 loops=1)"
"  Workers Planned: 2"
"  Workers Launched: 2"
"  ->  Hash Join  (cost=22808.51..37595.99 rows=28 width=376) (actual time=1199.647..1490.652 rows=56949 loops=3)"
"        Hash Cond: (((s_postings.id)::text = (s_postings_1.id)::text) AND (s_postings.load_dts = (max(s_postings_1.load_dts))))"
"        ->  Parallel Seq Scan on s_postings  (cost=0.00..14113.25 rows=128425 width=376) (actual time=0.016..73.604 rows=102723 loops=3)"
"        ->  Hash  (cost=20513.00..20513.00 rows=153034 width=75) (actual time=1195.616..1195.616 rows=170847 loops=3)"
"              Buckets: 262144  Batches: 1  Memory Usage: 20735kB"
"              ->  HashAggregate  (cost=17452.32..18982.66 rows=153034 width=75) (actual time=836.694..1015.499 rows=170847 loops=3)"
"                    Group Key: s_postings_1.id"
"                    ->  Seq Scan on s_postings s_postings_1  (cost=0.00..15911.21 rows=308221 width=75) (actual time=0.032..251.122 rows=308168 loops=3)"
"Planning time: 1.184 ms"
"Execution time: 1912.865 ms"

The row estimate is absolutely wrong! For me the weird thing is if I change the join to a right join now:

SELECT s_postings.*
FROM dwh.s_postings
RIGHT JOIN (SELECT s_postings.id,
      max(s_postings.load_dts) AS load_dts
   FROM dwh.s_postings
   GROUP BY s_postings.id) AS current_postings
ON s_postings.id = current_postings.id AND s_postings.load_dts = current_postings.load_dts

With the execution plan:

"Hash Right Join  (cost=22829.85..40375.62 rows=153177 width=376) (actual time=814.097..1399.673 rows=170848 loops=1)"
"  Hash Cond: (((s_postings.id)::text = (s_postings_1.id)::text) AND (s_postings.load_dts = (max(s_postings_1.load_dts))))"
"  ->  Seq Scan on s_postings  (cost=0.00..15926.10 rows=308510 width=376) (actual time=0.011..144.584 rows=308419 loops=1)"
"  ->  Hash  (cost=20532.19..20532.19 rows=153177 width=75) (actual time=812.587..812.587 rows=170848 loops=1)"
"        Buckets: 262144  Batches: 1  Memory Usage: 20735kB"
"        ->  HashAggregate  (cost=17468.65..19000.42 rows=153177 width=75) (actual time=553.633..683.850 rows=170848 loops=1)"
"              Group Key: s_postings_1.id"
"              ->  Seq Scan on s_postings s_postings_1  (cost=0.00..15926.10 rows=308510 width=75) (actual time=0.011..157.000 rows=308419 loops=1)"
"Planning time: 0.402 ms"
"Execution time: 1469.808 ms"

The row estimate is way better!

I am aware that for example parallel sequential scans can in some conditions decrease performance but they should not change the row estimate!? If I remember correctly aggregate functions also block the proper use of indexes anyway and also don’t see any potential gains with additional multivariate statistics e.g. for the tuple id, load_dts. The database is VACUUM ANALYZEd.

For me the queries are logically the same.

Is there a way to support the query planner to make better assumptions about the estimates or improve the query? Maybe somebody knows a reason why this difference exists?

Edit: Previously the join condition was ON s_postings.id::text = current_postings.id::text I changed that to ON s_postings.id = current_postings.id to not confuse anybody. Removing this conversion does not change the query plan.

Edit2: As suggested below there is a different solution to the greatest-n-per-group problem:

SELECT p.*
FROM (SELECT p.*,
             RANK() OVER (PARTITION BY p.id ORDER BY p.load_dts DESC) as seqnum
      FROM dwh.s_postings p
     ) p
WHERE seqnum = 1;

A really nice solution but sadly the query planner also underestimates the row count:

"Subquery Scan on p  (cost=44151.67..54199.31 rows=1546 width=384) (actual time=1742.902..2594.359 rows=171269 loops=1)"
"  Filter: (p.seqnum = 1)"
"  Rows Removed by Filter: 137803"
"  ->  WindowAgg  (cost=44151.67..50334.83 rows=309158 width=384) (actual time=1742.899..2408.240 rows=309072 loops=1)"
"        ->  Sort  (cost=44151.67..44924.57 rows=309158 width=376) (actual time=1742.887..1927.325 rows=309072 loops=1)"
"              Sort Key: p_1.id, p_1.load_dts DESC"
"              Sort Method: quicksort  Memory: 172275kB"
"              ->  Seq Scan on s_postings p_1  (cost=0.00..15959.58 rows=309158 width=376) (actual time=0.007..221.240 rows=309072 loops=1)"
"Planning time: 0.149 ms"
"Execution time: 2666.645 ms"

score 1 · Accepted Answer · answered Feb 24 '20 at 17:51

The difference in timing is not very large. It could easily just be caching effects. If you alternate between them back-to-back repeatedly, do you still get the difference? If you disable parallel execution by setting max_parallel_workers_per_gather = 0, does that equalize them?

The row estimate is absolutely wrong!

While this is obviously true, I don't think the misestimation is causing anything particularly bad to happen.

I am aware that for example parallel sequential scans can in some conditions decrease performance but they should not change the row estimate!?

Correct. It is the change in the JOIN type that causes the estimation change, and that in turn causes the change in parallelization. Thinking it has to push more tuples up to the leader (rather than disqualifying them down in the workers) discourages parallel plans, due to parallel_tuple_cost.

If I remember correctly aggregate functions also block the proper use of indexes

No, an index on (id, load_dts) or even just (id) should be usable for doing the aggregation, but since you need to read the entire table, it will probably be slower to read the entire index and entire table, than it is to just read the entire table into a HashAgg. You can test if PostgreSQL thinks it is capable of using such an index by setting enable_seqscan=off. If it does the seq scan anyway, then it doesn't think the index is usable. Otherwise, it just thinks using the index is counterproductive.

Is there a way to support the query planner to make better assumptions about the estimates or improve the query? Maybe somebody knows a reason why this difference exists?

The planner lacks the insight to know that every id,max(load_dts) from the derived table must have come from at least one row in the original table. Instead it applies the two conditions in the ON as independent variables, and doesn't even know what the most common values/histograms for your derived table will be so can't predict the degree of overlap. But with the RIGHT JOIN, it knows that every row in the derived table gets returned, whether a match is found in the "other" table or not. If you create a temp table from your derived subquery and ANALYZE it then use that table in the join, you should get better estimates because it at least knows how much the distributions in each column overlap. But those better estimate are not likely to load to hugely better plans, so I wouldn't bother with that complexity.

You can probably get some marginal speed up by rewriting it into a DISTINCT ON query, but it won't be magically better. Also note that these are not equivalent. The join will return all rows which are tied for first place within a given id, while DISTINCT ON will return an arbitrary one of them (unless you add columns to the ORDER BY to break the ties)

score -1 · Answer 2 · answered Feb 23 '20 at 19:57

-1

Use window functions:

SELECT p.*
FROM (SELECT p.*,
             RANK() OVER (PARTITION BY p.id ORDER BY p.load_dts DESC) as seqnum
      FROM dwh.s_postings p
     ) p
WHERE seqnum = 1;

Or, better yet, if you want one row per id use DISTINCT ON:

SELECT DISTINCT ON (p.id) p.*
FROM dwh.s_postings p
ORDER BY p.id, p.load_dts DESC;

If I had to speculate, the conversion of the id -- which is utterly unnecessary -- throws off the optimizer. With the right join it is clear that all rows are kept from one of the tables, and that might help the statistics calculation.

answered Feb 23 '20 at 19:57

Gordon Linoff

1,242,037
58
646
786

Thank you for your answer. The data warehouse uses a variant of data vault methodology which generates its primary keys as hashes out of other ids thus using `VARCHAR` columns as id. The conversion is auto generated by the DB because the shown queries are part of views in the warehouse. Removing the conversion does not change anything. As for your queries: they are known, good solutions for the `greatest-n-per-group` problem but till now I just went with the solution which works in all RDBMs. Do you think `VARCHAR` as id column can cause this kind of behavior? – SebastianBerndt Feb 23 '20 at 21:14
1

@SebastianBerndt . . . What do mean "works in all RDBMSs"? The first query is standard SQL. The second is based on how YOU tagged the question. – Gordon Linoff Feb 24 '20 at 02:19
This part of the query takes only 200 ms, so at best your solution will save that much. The real problem is the bad estimate for the join, which probably causes problems later on (we were not shown the whole query). Your answer does nothing about the join estimate. – Laurenz Albe Feb 24 '20 at 07:31
@Gordon Yes you are absolutely right! I just wanted to justify the choice of my solution in that it was just more obvious to me than the `DISTINCT ON` one. The `DISTINCT ON` is a really nice solution in Postgres and I might test this vs the right join version. However my main concern is understanding why my solution does not work although I found countless examples of people doing it exactly as I did and nobody reports this problem rather than just finding an alternative. Even the window function version suffers from this wrong row estimate (I will add the result later). – SebastianBerndt Feb 24 '20 at 10:53
@SebastianBerndt . . . Unnecessary type conversion is going to confuse the database optimizer. There is no mystery. Write the query correctly. – Gordon Linoff Feb 24 '20 at 11:21
@GordonLinoff Do you mind telling me how exactly I do that? What exactly you have in mind? I removed the conversion but as said before when creating the query as a view the conversion gets added automatically. I also edited the query in the question so nobody else gets confused. `EXPLAIN ANALYZE`ing the query without conversion does not change the plan. – SebastianBerndt Feb 24 '20 at 13:27

Query plan difference inner join/right join "greatest-n-per-group", self joined, aggregated query

2 Answers2