Multi column order by kills query performance even when the time range does not contain any records

Question

I have a fairly small table of 26 million records.

CREATE TABLE t1
(
   cam         varchar(100)    NOT NULL,
   updatedat     timestamp,
   objid      varchar(40)     NOT NULL,
   image         varchar(100)    NOT NULL,
   reader      varchar(60)     NOT NULL,
   imgcap        timestamp       NOT NULL
);


ALTER TABLE t1
   ADD CONSTRAINT t1_pk
   PRIMARY KEY (reader, cam, image, objid, imgcap);

I have a simple query to iterate the records between a time range.

SELECT * FROM t1  
WHERE updatedat >= '2021-12-09 20:30:00'  and updatedat <= '2021-12-09 20:32:01'
ORDER BY reader ASC , imgcap ASC, objid ASC, cam ASC, image ASC
LIMIT 10000 
OFFSET 0;

I added an index to support the query with the comparison as the left most field and the remaining elements to support the sort.

CREATE INDEX t1_idtmp ON t1 USING btree (updatedat , reader , imgcap , objid, cam, image);

However, the query takes more than 10 seconds to get complete. It takes same time even if there are no elements in the range.

  ->  Incremental Sort  (cost=8.28..3809579.24 rows=706729 width=223) (actual time=11034.114..11065.710 rows=10000 loops=1)
        Sort Key: reader, imgcap, objid, cam, image
        Presorted Key: reader, imgcap
        Full-sort Groups: 62  Sort Method: quicksort  Average Memory: 42kB  Peak Memory: 42kB
        Pre-sorted Groups: 62  Sort Methods: top-N heapsort, quicksort  Average Memory: 58kB  Peak Memory: 58kB
        ->  Index Scan using t1_idxevtim on t1  (cost=0.56..3784154.75 rows=706729 width=223) (actual time=11033.613..11036.823 rows=10129 loops=1)
              Filter: ((updatedat >= '2021-12-09 20:30:00'::timestamp without time zone) AND (updatedat <= '2021-12-09 20:32:01'::timestamp without time zone))
              Rows Removed by Filter: 25415461
Planning Time: 0.137 ms
Execution Time: 11066.791 ms

There are couple of more indexes on table to support other use cases.

CREATE INDEX t1_idxua ON t1 USING btree (updatedat);

CREATE INDEX t1_idxevtim ON t1 USING btree (reader, imgcap);

I think, Postgresql wants to avoid an expensive sort and thinks that pre sorted key will be faster but why does Postgresql not use the t1_idtmp index as both search & sort can be satisfied with it?

Index t1_idtmp might be too big (in size) compared to t1_idxevtim. Did you try an index on just updatedat and reader? And what about the statistics? The current index scan is removing 250 million rows, that is a lot... Looks like something is wrong in the statistics and might want to change the settings or create some stats for this query. — Frank Heikens, Dec 12 '21 at 20:55
That is a remarkable claim, and one not supported by the execution plan you show. Can you show the execution plan for the empty range? Also, VACUUM ANALYZE the table first. — jjanes, Dec 12 '21 at 21:33

score 2 · Accepted Answer · answered Dec 12 '21 at 23:22

2

why does Postgresql not use the t1_idtmp index as both search & sort can be satisfied with it?

Because the sort can't be satisfied by it. An btree index on (updatedat , reader , imgcap , objid, cam, image) can only produce data ordered by reader , imgcap , objid, cam, image for within ties of updatedat. So if your condition was for a specific value of updatedat, that would work. But since it is for a range of updatedat, that won't work as they are not all tied with each other.

answered Dec 12 '21 at 23:22

jjanes

37,812
5
27
34

Thanks for the explanation. I created another column by truncating updatedat to the minute level and used that for equality comparison which solved the issue. However, I tried to use the initial index by passing a list of updated at the seconds level like updatedat in (2021-12-09 20:30:00, 2021-12-09 20:30:01, 2021-12-09 20:30:02 .. 2021-12-09 20:30:05). Is there any reason for Postgresql not to use index? Will postgresql only use the index if there is only one comparison? – mohit Dec 13 '21 at 07:20
1

There is no good reason it can't use the index for that by doing several disjointed ordered scans and merging the results while preserving order, the planner just hasn't been taught how to yet. You can force its hand by writing the query in a rather odd way, as a UNION ALL with redundant ORDER BY clauses. See https://stackoverflow.com/questions/70329905/postgresql-query-with-where-order-by-and-limit-very-slow/70339186#70339186 – jjanes Dec 13 '21 at 20:57

Multi column order by kills query performance even when the time range does not contain any records

1 Answers1