Why does postgresql planner choose bad plan, for a few specific values only

Question

I have a biggish table in postgresql 15.1 — maybe 50 million rows and growing. A column mmsi has about 30k distinct values, so 1000+ rows per mmsi.

My problem is that I have a query that I need to execute repeatedly during DB load, and for certain values of mmsi it takes hundreds of seconds instead of milliseconds. The model query is simply

SELECT max(to_timestamp) FROM track WHERE mmsi = :mmsi

The anomaly is visible in the EXPLAIN output. The bad case (which only happens for a small fraction of mmsi values):

trackdb=# EXPLAIN SELECT max(to_timestamp) FROM track WHERE mmsi = 354710000;
QUERY PLAN
----------
Result  (cost=413.16..413.17 rows=1 width=8)
  InitPlan 1 (returns $0)
    - >  Limit  (cost=0.56..413.16 rows=1 width=8)
        - >  Index Scan Backward using ix_track_to_timestamp on track  (cost=0.56..3894939.14 rows=9440 width=8)
             Index Cond: (to_timestamp IS NOT NULL)
             Filter: (mmsi = 354710000)
(6 rows)

Good case (the vast majority):

trackdb=# EXPLAIN SELECT max(to_timestamp) FROM track WHERE mmsi = 354710001;
QUERY PLAN
----------
Aggregate  (cost=1637.99..1638.00 rows=1 width=8)
  - >  Index Scan using ix_track_mmsi on track  (cost=0.44..1635.28 rows=1082 width=8)
       Index Cond: (mmsi = 354710001)
(3 rows)

Now, I notice that the estimated number of rows is larger in the bad case. I can not see anything in the postgresql statistics (pg_stats.histogram_bounds) to explain this.

The problem seems to change when I ANALYZE the table, in that the specific values to trigger the problem becomes different. But anyhow, since this is needed during DB load, ANALYZE is not a solution.

I'm stumped. Does anyone have an idea what could be happening?

[Edit: To clarify, I know ways to work around it, for example by materializing the rows before applying max. But not understanding makes me very unhappy.]

354710000 is a frequent value, and PostgreSQL thinks that it is frequent enough that it is worth scanning the table in backwards `to_timestamp` order until it finds a matching row. — Laurenz Albe, Dec 05 '22 at 17:39
Ah, thank you Laurentz, I somehow missed that the statistics has a separate list of `most_common_vals`. Their frequencies are wildly overestimated, and the plan would be worse even if they were correct; but these are "understandable" problems. — Joachim Berdal Haga, Dec 05 '22 at 17:48
You could try increasing the statistics target for that table to get better estimates. — , Dec 05 '22 at 20:46
@a_horse_with_no_name — bizarrely, increasing the statistics target made the problem worse, because more values were stored as "common" and hence singled out for the bad plan. Which I guess is a sign that the plan is fragile, so adding a special-case index (like jjanes recommended) is necessary. — Joachim Berdal Haga, Dec 06 '22 at 10:21

score 1 · Accepted Answer · answered Dec 05 '22 at 22:30

As Laurenz has explained, the problem is that PostgreSQL thinks the approx 10,000 rows where mmsi = 354710000 are scattered randomly over the values of to_timestamp, and so thinks that by scanning the index over to_timestamp in order, it can stop as soon as it finds the first one meeting mmsi = 354710000 and that will happen quickly. But of all the mmsi = 354710000 are on the wrong end of the index, it does not in fact happen quickly. There is nothing you can do about this in the stats, as there are no "handles" it can grab into to better inform its thinking. Maybe some future extension to the custom stats feature will do it.

Edit: To clarify, I know ways to work around it, for example by materializing the rows before applying max.

A better ~~work around~~ solution would probably be an index on (mmsi,to_timestamp). This would not only fix the case where it currently chooses a very bad plan, it would substantially improve the cases currently using a tolerable plan by giving them an even better option. And you don't need to rewrite the query. And you can drop the existing index just on mmsi, as there is no reason to have both.

Yes, that makes sense. The overall distribution of `to_timestamp` is reasonably uniform, but that is not the case during loading. Due to the loading order (monthly batches, ordered by `mmsi`), there will typically be a few million rows to chew through in that backwards scan. — Joachim Berdal Haga, Dec 06 '22 at 10:08

Why does postgresql planner choose bad plan, for a few specific values only

1 Answers1