2

Background

I'm getting insanely slow queries when using small LIMIT clause values on a relatively simple query.

I've already read and re-read PostgreSQL query very slow with limit 1. (not that I couldn't have missed anything, but it's related, and folks shouldn't just direct me there without giving me a clue as to what I might be missing first).

I'm definitely encountering essentially the bug referred to there: for small values of limit, this query takes around 7,409,626 vs. 53 ns.

Simply changing LIMIT from 1 to 1000 gives me the instantaneous speed, dropping it to 10 or 1 gives me the OMGBBQ what is wrong here speed.

I tried applying the basic advice from the above linked SO: add an extra useless ORDER BY column to the query to trick the planner.

However, in my case, the planner is very slow even if I try to put the main query without limits in a WITH clause!

The Query

select id 
from rounds 
where userid = (
  select id
  from users
  where integrationuserid = 'sample:64ce5bad-8c48-44a4-b473-5a7451980bb2') 
order by created desc 
limit 1;

Some explain-analyze results:

Naive query with Limit = 1

explain analyze select id from rounds where userid = (select id from users where integrationuserid = 'sample:64ce5bad-8c48-44a4-b473-5a7451980bb2') order by created desc, userid limit 1;
                 QUERY PLAN
------------------------------------------------
 Limit  (cost=3.07..47.03 rows=1 width=40) (actual time=7408.097..7408.099 rows=1 loops=1)
   InitPlan 1 (returns $0)
     ->  Index Scan using users_integrationuserid_idx on users  (cost=0.41..2.63 rows=1 width=16) (actual time=0.013..0.014 rows=1 loops=1)
           Index Cond: (integrationuserid = 'sample:64ce5bad-8c48-44a4-b473-5a7451980bb2'::text)
   ->  Index Scan using recent_rounds_idx on rounds  (cost=0.44..938182.73 rows=21339 width=40) (actual time=7408.096..7408.096 rows=1 loops=1)
         Filter: (userid = $0)
         Rows Removed by Filter: 23123821
 Planning Time: 0.133 ms
 Execution Time: 7408.114 ms
(9 rows)

vs. Limit = 1000 (arbitrary, just to see what happens)

explain analyze select id from rounds where userid = (select id from users where integrationuserid = 'sample:64ce5bad-8c48-44a4-b473-5a7451980bb2') order by created desc, userid limit 1000;
                    QUERY PLAN
------------------------------------------------
 Limit  (cost=24163.47..24165.97 rows=1000 width=40) (actual time=0.048..0.049 rows=1 loops=1)
   InitPlan 1 (returns $0)
     ->  Index Scan using users_integrationuserid_idx on users  (cost=0.41..2.63 rows=1 width=16) (actual time=0.018..0.019 rows=1 loops=1)
           Index Cond: (integrationuserid = 'sample:64ce5bad-8c48-44a4-b473-5a7451980bb2'::text)
   ->  Sort  (cost=24160.84..24214.18 rows=21339 width=40) (actual time=0.047..0.048 rows=1 loops=1)
         Sort Key: rounds.created DESC
         Sort Method: quicksort  Memory: 25kB
         ->  Bitmap Heap Scan on rounds  (cost=226.44..22990.84 rows=21339 width=40) (actual time=0.043..0.043 rows=1 loops=1)
               Recheck Cond: (userid = $0)
               Heap Blocks: exact=1
               ->  Bitmap Index Scan on rounds_userid_idx  (cost=0.00..221.10 rows=21339 width=0) (actual time=0.040..0.040 rows=1 loops=1)
                     Index Cond: (userid = $0)
 Planning Time: 0.108 ms
 Execution Time: 0.068 ms
(14 rows)

My basic questions are:

  1. Why is it so omg bad in the first place (in what universe would scanning all rows in the entire db beat scanning a smaller subset after applying the where clause)?
  2. How can I work around this?

AFAICT - I need the planner to reduce the original table to just the rows that match the WHERE clause - then apply the sort & limit to that.

But instead it applies the sort & limit to the entire table - in this case - around 23M items - and with unsurprisingly horrible results.

I've tried many dozens of syntaxes that try to create a subquery that extracts the user's rounds first, and then tries to apply the limit. But again, the planner sees through this and applies the limit on the original full 23M items.

Additional Attempts / Information

This article indicates that the original answer (to my initial link, above) no longer works as of pg 13 - and to use a CTE.

Postgres SLOWER when a LIMIT is set: how to fix besides adding a dummy `ORDER BY`?

However, that was basically my first instinct - and all uses of a CTE have failed for me.

One attempt with a CTE: (VERY SLOW!)

with r as (
  select id, created 
  from rounds 
  where userid = (
    select id
    from users 
    where integrationuserid = 'sample:64ce5bad-8c48-44a4-b473-5a7451980bb2')
) 
select r.id from r order by r.created desc limit 1;

Maybe doing random stuff with moving order by and limit around helps? (NO!)

with r as (
   select id, created from rounds 
   where userid = (select id from users where integrationuserid = 'sample:64ce5bad-8c48-44a4-b473-5a7451980bb2') 
   order by created desc
)
select r.id from r limit 1;

Solution (thanks to @jjanes for pointing me to it, and to @denis-de-bernardy for sharing it in the first place): Solution, btw:

create index recent_rounds_idx on rounds (userid, created desc);

Mordachai
  • 9,412
  • 6
  • 60
  • 112
  • And just to be clear: changing that subquery to be a static value helps not at all. As you can see in the planner, it takes next to nothing to do that lookup - and simplifying it has no bearing on the planner for the actual query. It just always scans 23M rows. >:( – Mordachai Jun 26 '23 at 21:01
  • 1
    As the accepted answer to the linked question says, build a multicolumn index. In this case, `(userid, created)` – jjanes Jun 26 '23 at 22:06
  • @jjanes thank you! I had read the explanation, but it was not as upvoted as the one I tried - which originally - appeared to solve the problem! At any rate, I've edited that answer so you can see it more readily. [btw, I linked two different SO questions - and the accepted answer to the second one does not work either. It was not at all clear from your comment to which you were referring.] – Mordachai Jun 26 '23 at 23:26
  • Does the index on `rounds (userid, created)` exist? – The Impaler Jun 26 '23 at 23:37
  • @TheImpaler adding that was literally the solution to the problem. – Mordachai Jun 27 '23 at 15:33

2 Answers2

1

The core problem is the bad estimate for the index scan: 21339 rows estimated, when there is actually only one. ANALYZE the table and see if that is enough to improve the estimate. If not, increase the statistics for the column:

ALTER TABLE rounds ALTER userid SET STATISTICS 500;
ANALYZE rounds;

If you want to afford the perfect index for the query, the problem should go away too:

CREATE INDEX ON rounds (userid, created);

Then you could drop the index ´rounds_userid_idx`, because the new index can do everything that that index can (unless it is a unique index).

Laurenz Albe
  • 209,280
  • 17
  • 206
  • 263
  • I'll have to try messing with statistics in the future. For now, supplying a dedicated index solved our issue - albeit - as you say - we don't need the old index with this one. Good point! – Mordachai Jun 27 '23 at 15:32
1

Why is it so omg bad in the first place (in what universe would scanning all rows in the entire db beat scanning a smaller subset after applying the where clause)?

The universe where it doesn't think that that will happen. By traversing the index for rows already in the desired order, it thinks it can stop after finding the first row meeting the required userid, which means it thinks it can stop after reading 1/21340 of the table. This is obviously very wrong, because there is only 1 row meeting that condition, not 21339 of them like it thinks.

So a side question, why is the estimate so very wrong? Since it is making a generic estimate here (at planning time, it doesn't know what the value of $0 will turn out to be), it guesses the number of qualifying rows will be the estimated number of rows in the table, divided by the estimate for n_distinct for userid (adjusted for null values, but I will ignore that for simplicity). So either this is correct and there is some value(s) of userid for which there are an awful lot of rows (but they just happen not to be the values you are looking for) and those ones pull up the average.

Or it is incorrect and the estimate for n_distinct is way off from reality. We can't know which from afar. You could figure it out by looking in pg_stats.n_distinct to see the estimate, and comparing that the true value of count(distinct userid) from rounds. If the estimate is way off, you can manually fix it by doing alter table rounds alter column userid set (N_DISTINCT = <true value>). You need to ANALYZE the table before this will take effect. The problem of grossly wrong n_distinct is due to inadequate selection of the sample used for computing stats (every row is equally likely to be chosen to be in the sample, but the selection is not strictly independent for each row from the others), and increasing STATISTICS as Laurenz recommends is only weakly able to overcome this problem. while setting N_DISTINCT tackles it directly.

One attempt with a CTE: (VERY SLOW!)

Your attempt was very close. But modern PostgreSQL can "see through" the CTE and plan it the same was as it would plan the original query. To prevent that, you need to declare the CTE as being materialized.

with r as materialized (
  ....

This is just by way of explanation of course, because having the multicolumn index is a simple powerful solution. By giving you the best of both worlds (both selectivity on userid and ordering on created, at the same time) it removes the need to make fraught decisions about which one is more important, like the planner needs to do when you have the two single-column indexes.

jjanes
  • 37,812
  • 5
  • 27
  • 34
  • Thanks! I've only ever done relatively simple SQL over my career. Some interesting things with CTEs and `for update` locking tricks. But understanding at this level has never been an interest nor need. D'oh! Clearly, it's becoming a need :D Thanks for taking the time to explain this much more deeply. It's piqued my curiosity to understand a bit more about how the statics work and are used by the planner. :) – Mordachai Jun 28 '23 at 12:44