22

I have a table that has a datetime field "updated_at". A lot of my queries will be querying on this field using range queries such as rows that have updated_at > a certain date.

I already added an index to updated_at, but most of my queries are still very slow, even when I had a limit to the number of rows return.

What else can I do to optimize queries that query on datetime fields?

Craig Ringer
  • 307,061
  • 76
  • 688
  • 778
Henley
  • 21,258
  • 32
  • 119
  • 207
  • 6
    Can you post the explain plan, total number of rows and the exact value of "very slow"? – Jakub Kania May 19 '13 at 21:13
  • Please read http://stackoverflow.com/tags/postgresql-performance/info (and the linked SlowQueryQuestions wiki page) then update your question with suitable `explain analyze` results and report back. Since you're using a query generator you might need to use `auto_explain` or to log queries and re-execute them by hand. – Craig Ringer May 19 '13 at 23:43
  • Please post the schema and the type of query that is slow. The question as it is phrased cannot reasonably be answered... – Denis de Bernardy May 20 '13 at 05:45

4 Answers4

11

Usually database optimizers won't chose to use indexes for open-ended ranges, such a updated_at > somedate.

But, in many cases the datatime column won't exceed "now", so you can preserve the semantic of > somedate by converting the condition to a range by using between like this:

where updated_at between somedate and current_timestamp

A between predicate is much more likely to cause the optimizer to chose to use an index.


Please post if this approach improved your query’s performance.

Bohemian
  • 412,405
  • 93
  • 575
  • 722
  • 3
    Is this really true of PostgreSQL? I would think that the optimiser would look at the range of values in the relevant column, via pg_statistics, and produce an estimated cardinality of the result set for the predicate. If the maximum value is less than or equal to current_timestamp then I wouldn't think that there would be much difference. Be interesting for Henley to test though -- the explain plan would reveal all. – David Aldridge May 19 '13 at 21:31
  • @DavidAldridge in my experience, `>` is just not optimized well. I also like Harvey to post results. – Bohemian May 19 '13 at 22:04
  • 8
    Postgres **will** use an index for `>` if it's useful. No need to for a `between`: See here for an example http://sqlfiddle.com/#!12/e3142/3 It all depends - as usual with an index - whether or not the cost of using an index is less than something else –  May 19 '13 at 22:12
  • 1
    `>` is optimized just fine in Postgres and uses BTree indexes when it's appropriate to do so based on the table stats. – Denis de Bernardy May 20 '13 at 05:40
  • @a_horse_with_no_name nice fiddle. I'm sold. Still I'll leave the answer until OP posts results – Bohemian May 20 '13 at 06:39
  • 1
    Using `between` massively improved performance for me on redshift – Paul Odeon Nov 26 '20 at 10:04
  • 1
    Using `between` solved my issue too. – Weihang Jian Mar 29 '22 at 11:02
5

For any given query, the use of an index depends on the cost of using that index compared to a sequential scan

Frequently developers think that because there is an index, a query should run faster, and if a query runs slow, an index is the solution. This is usually the case when the query will return few tuples. But as the number of tuples in the result increases, the cost of using an index might increase.

You are using postgres. Postgres does not support clustering around a given attribute. That means that postgres, when confronted with a range query (of the type att > a and att < b) needs to compute an estimation of the number of tuples in the result (make sure you vacuum your database frequently) and the cost of using an index compared to doing a sequential scan. it will then decide what method to use.

you can inspect this decision by running

EXPLAIN ANALYZE <query>; 

in psql. It will tell you if it uses an index or not.

If you really, really want to use the indexes instead of a sequential scan (sometimes it is needed) and you really really know what you are doing, you can change the cost of a sequential scan in the planner constants or disable sequential scans in favor of any other method. See this page for the details:

http://www.postgresql.org/docs/9.1/static/runtime-config-query.html

Make sure you browse the correct version of the documentation.

--dmg

dmg
  • 4,231
  • 1
  • 18
  • 24
2

I had a similar case in a table with near 1M rows.

So I created a index b-tree on visited_at (datetime field) and tried a query for all rows:

explain analyze select mes,count(usuario) as usuarios
from (
   SELECT distinct coalesce(usuario, ip) as usuario, (extract(year from visited_at), extract(month from visited_at))   AS mes
     FROM pageview 
     ) as usuarios
group by 1
order by 1

I got:

GroupAggregate (cost=445468.78..451913.54 rows=200 width=64) (actual time=31027.876..31609.754 rows=8 loops=1)
-> Sort (cost=445468.78..447616.37 rows=859035 width=64) (actual time=31013.501..31439.350 rows=358514 loops=1)
Sort Key: usuarios.mes
Sort Method: external merge Disk: 24000kB
-> Subquery Scan on usuarios (cost=247740.16..263906.75 rows=859035 width=64) (actual time=23121.403..28200.175 rows=358514 loops=1)
-> Unique (cost=247740.16..255316.40 rows=859035 width=48) (actual time=23121.400..28129.538 rows=358514 loops=1)
-> Sort (cost=247740.16..250265.57 rows=1010166 width=48) (actual time=23121.399..27559.241 rows=1010702 loops=1)
Sort Key: (COALESCE(pageview.usuario, (pageview.ip)::text)), (ROW(date_part('year'::text, pageview.visited_at), date_part('month'::text, pageview.visited_at)))
Sort Method: external merge Disk: 66944kB
-> Seq Scan on pageview (cost=0.00..84842.49 rows=1010166 width=48) (actual time=0.012..1909.324 rows=1010702 loops=1)
Total runtime: 31632.012 ms

That means no improvement over the query before the index.

But so I reduced the rows to current_date-31

explain analyze select mes,count(usuario) as usuarios
from (
   SELECT distinct coalesce(usuario, ip) as usuario, (extract(year from visited_at), extract(month from visited_at))   AS mes
     FROM pageview 
     where visited_at > current_date - 31
     ) as usuarios
group by 1
order by 1

and got

 -> Sort (cost=164735.62..165310.93 rows=230125 width=64) (actual time=9532.343..9602.743 rows=90871 loops=1)
Sort Key: usuarios.mes
Sort Method: external merge Disk: 5872kB
-> Subquery Scan on usuarios (cost=122598.79..126929.62 rows=230125 width=64) (actual time=7251.344..9178.901 rows=90871 loops=1)
-> Unique (cost=122598.79..124628.37 rows=230125 width=48) (actual time=7251.343..9157.837 rows=90871 loops=1)
-> Sort (cost=122598.79..123275.32 rows=270610 width=48) (actual time=7251.341..8932.541 rows=294915 loops=1)
Sort Key: (COALESCE(pageview.usuario, (pageview.ip)::text)), (ROW(date_part('year'::text, pageview.visited_at), date_part('month'::text, pageview.visited_at)))
Sort Method: external merge Disk: 18864kB
-> Bitmap Heap Scan on pageview (cost=5073.60..81528.85 rows=270610 width=48) (actual time=111.950..1877.603 rows=294915 loops=1)
Recheck Cond: (visited_at > (('now'::cstring)::date - 31))
Rows Removed by Index Recheck: 338268
-> Bitmap Index Scan on visited_at_index (cost=0.00..5005.94 rows=270610 width=0) (actual time=109.874..109.874 rows=294915 loops=1)
Index Cond: (visited_at > (('now'::cstring)::date - 31))
Total runtime: 9687.460 ms

I got a small improvement casting datetime to date (visited_at::date)

explain analyze select mes,count(usuario) as usuarios
from (
   SELECT distinct coalesce(usuario, ip) as usuario, (extract(year from visited_at::date), extract(month from visited_at::date))   AS mes
     FROM pageview 
     where visited_at::date > current_date - 31
     ) as usuarios
group by 1
order by 1

and got

GroupAggregate (cost=201976.97..204126.56 rows=200 width=64) (actual time=9040.196..9102.098 rows=2 loops=1)
-> Sort (cost=201976.97..202692.83 rows=286345 width=64) (actual time=9035.624..9058.457 rows=88356 loops=1)
Sort Key: usuarios.mes
Sort Method: external merge Disk: 5704kB
-> Subquery Scan on usuarios (cost=149102.66..154491.53 rows=286345 width=64) (actual time=7511.231..8840.270 rows=88356 loops=1)
-> Unique (cost=149102.66..151628.08 rows=286345 width=48) (actual time=7511.229..8823.647 rows=88356 loops=1)
-> Sort (cost=149102.66..149944.47 rows=336722 width=48) (actual time=7511.227..8666.667 rows=287614 loops=1)
Sort Key: (COALESCE(pageview.usuario, (pageview.ip)::text)), (ROW(date_part('year'::text, ((pageview.visited_at)::date)::timestamp without time zone), date_part('month'::text, ((pageview.visited_at)::date)::timestamp without time zone)))
Sort Method: external merge Disk: 18408kB
-> Seq Scan on pageview (cost=0.00..97469.57 rows=336722 width=48) (actual time=0.018..1946.139 rows=287614 loops=1)
Filter: ((visited_at)::date > (('now'::cstring)::date - 31))
Rows Removed by Filter: 722937
Total runtime: 9108.644 ms

That are the tweaks that worked to me:

1) index b-tree (mainly) 2) casting to date (small difference)

10s still is a big time to response to user.

So my solution was create table month_users and use once

insert from month_users select mes,count(usuario) as usuarios
from (
   SELECT distinct coalesce(usuario, ip) as usuario, (extract(year from visited_at), extract(month from visited_at))   AS mes
     FROM pageview 
     ) as usuarios
group by 1
order by 1

and use

select * from month_users

results:

Seq Scan on usuarios_mes (cost=0.00..21.30 rows=1130 width=42) (actual time=0.302..0.304 rows=8 loops=1)
Total runtime: 0.336 ms

A acceptable result now!

A final solution still have to consider how update the table results regularly.

1

Assuming that the index is being used but performance is still poor, the only remedy I can think of is to cluster the table by that index: http://www.postgresql.org/docs/9.1/static/sql-cluster.html

This will move the rows with the same update_at value to be co-located physically, improving performance of a query that access that table via the index, particularly for large range scans.

Pay attention to the warnings in the documentation though, and note that as rows are updated the clustering is not preserved.

Also:

When a table is being clustered, an ACCESS EXCLUSIVE lock is acquired on it. This prevents any other database operations (both reads and writes) from operating on the table until the CLUSTER is finished.

Based on these restrictions it may not be a viable solution for your case, but might be useful to others.

David Aldridge
  • 51,479
  • 8
  • 68
  • 96