How can I optimize a postgresql query where one dependent column is a timestamp

Question

I have a table with a foreign key and a timestamp for when the row was most recently updated. rows with the same foreign key value are updated at roughly the same time, plus or minus an hour. I have an index on (foreign_key, timestamp). This is on postgresql 11.

When I make a query like:

select * from table where foreign_key = $1 and timestamp > $2 order by primary_key;

It will use my index in cases where the timestamp query is selective across the entire table. But if the timestamp is far enough in the past that the majority of rows match it will scan the primary_key index assuming it'll be faster. This problem goes away if I remove the order by.

I've looked at Postgresql's CREATE STATISTICS, but it doesn't seem to help in cases where the correlation is over a range of values like a timestamp plus or minus five minutes, rather than an specific value.

What are the best ways to work around this? I can remove the order by, but that complicates the business logic. I can partition the table on the foreign key id, but that is also a pretty expensive change.

Specifics:

                                            Table "public.property_home_attributes"
        Column        |            Type             | Collation | Nullable |                       Default
----------------------+-----------------------------+-----------+----------+------------------------------------------------------
 id                   | integer                     |           | not null | nextval('property_home_attributes_id_seq'::regclass)
 mls_id               | integer                     |           | not null |
 property_id          | integer                     |           | not null |
 formatted_attributes | jsonb                       |           | not null |
 created_at           | timestamp without time zone |           |          |
 updated_at           | timestamp without time zone |           |          |
Indexes:
    "property_home_attributes_pkey" PRIMARY KEY, btree (id)
    "index_property_home_attributes_on_property_id" UNIQUE, btree (property_id)
    "index_property_home_attributes_on_updated_at" btree (updated_at)
    "property_home_attributes_mls_id_updated_at_idx" btree (mls_id, updated_at)

The table has about 16 million rows.

psql=# EXPLAIN ANALYZE SELECT * FROM property_home_attributes WHERE mls_id = 46 AND (property_home_attributes.updated_at < '2019-10-30 16:52:06.326774') ORDER BY id ASC LIMIT 1000;
                                                                                     QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.56..10147.83 rows=1000 width=880) (actual time=1519.718..22310.674 rows=1000 loops=1)
   ->  Index Scan using property_home_attributes_pkey on property_home_attributes  (cost=0.56..6094202.57 rows=600576 width=880) (actual time=1519.716..22310.398 rows=1000 loops=1)
         Filter: ((updated_at < '2019-10-30 16:52:06.326774'::timestamp without time zone) AND (mls_id = 46))
         Rows Removed by Filter: 358834
 Planning Time: 0.110 ms
 Execution Time: 22310.842 ms
(6 rows)

and then without the order by:

psql=# EXPLAIN ANALYZE SELECT * FROM property_home_attributes WHERE mls_id = 46 AND (property_home_attributes.updated_at < '2019-10-30 16:52:06.326774')  LIMIT 1000;
                                                                     QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.56..1049.38 rows=1000 width=880) (actual time=0.053..162.081 rows=1000 loops=1)
   ->  Index Scan using foo on property_home_attributes  (cost=0.56..629893.60 rows=600576 width=880) (actual time=0.053..161.992 rows=1000 loops=1)
         Index Cond: ((mls_id = 46) AND (updated_at < '2019-10-30 16:52:06.326774'::timestamp without time zone))
 Planning Time: 0.100 ms
 Execution Time: 162.140 ms
(5 rows)

In my case I think I should just use an integer column for versioning that I increment when doing updates, rather than updating a timestamp, as this will simplify indexing and the cross column dependency will be much clearer for postgresql multivariate statistics. I'll leave this open for now though, as I'm curious if there is a more general solution to the problem I was having. — Doug, Dec 07 '19 at 06:20

score 1 · Answer 1 · answered Dec 06 '19 at 08:07

1

If you want to keep PostgreSQL from using an index scan on property_home_attributes_pkey to support the ORDER BY, you can simply use

ORDER BY primary_key + 0

answered Dec 06 '19 at 08:07

Laurenz Albe

209,280
17
206
263

How can I optimize a postgresql query where one dependent column is a timestamp

1 Answers1