4

I have a large table that I run queries like select date_att > date '2001-01-01' on regularly. I'm trying to increase the speed of these queries by clustering the table on date_att, but when I run those queries through explain analyze it still chooses to sequentially scan the table, even on queries as simple as SELECT date_att from table where date_att > date '2001-01-01'. Why is this the case? I understand that since the query returns a large portion of the table, the optimizer will ignore the index, but since the table is clustered by that attribute, shouldn't it be able to really quickly binary search through the table to the point where date > '2001-01-01' and return all results after that? This query still takes as much time as without the clustering.

abu
  • 948
  • 1
  • 7
  • 14
  • 1
    Can you show your table and index definition, I am confused by the fact you say you have a clustered index on the date column, then later you refer to *"sequentially scan the table"*, a table scan **is** a clustered index scan. – GarethD Dec 05 '13 at 21:12
  • 1
    Thanks for the quick response- The table definition is: `CREATE TABLE test(dummy_primary_key integer, date_att date, primary key(dummy_primary_key));` Then I create an index on date_att and cluster on it with the following commands: `CREATE INDEX dateindex on test(date_att); CLUSTER test using dateindex;` Then, running `SELECT date_att FROM test WHERE date_att > date '2001-01-01'` still takes around 10 seconds. Test holds around a gig of data and the query returns about half of that. From my understanding of clustered indexes, this should be very quick but its still quite slow. – abu Dec 05 '13 at 21:14
  • Hmmm, afraid this is a mystery to me. I don't know enough about Postgresql to know how the analyser works, what is interesting is that with the table definition you have posted the index is used, as soon as other columns are added to the table the index is no longer used, even though the other columns are not referenced in the query. http://sqlfiddle.com/#!15/94380/1 – GarethD Dec 05 '13 at 21:22
  • Strange... I just tried the same on psql command line and it seqscans no matter what the definition is... Thanks anyways! – abu Dec 05 '13 at 21:32
  • 2
    Please post the the execution plan from `explain analyze` (or upload it to http://explain.depesz.com). Do **not** post this as a comment, edit your question. You might also want to read this: https://wiki.postgresql.org/wiki/Slow_Query_Questions –  Dec 05 '13 at 22:10
  • `WHERE date_att > date '2001-01-01'` It really depends on the distibution of your dates. If the statistics indicate that > xxx percent of your rows will be selected it won't help much to use an index (except for final sorting, maybe) What is the distribution of your dates? do you have valid statistics? What are the settings for your tuning-constants? – wildplasser Dec 05 '13 at 23:54
  • It returns over the percentage that makes sense for a normal index, but if I cluster the table on the index, doesn't that mean the physical ordering of the data follows the index? so if I want all dates over a certain amount, I just have to access the disk using a binary search to that date, then get all data from that row to the end of the table? Am I misunderstanding this? – abu Dec 06 '13 at 08:15
  • If you always use the same few dates, maybe you should look into partial indexes from Postgre http://www.postgresql.org/docs/9.3/static/indexes-partial.html – tsadiq Dec 06 '13 at 09:56

1 Answers1

11

It seems like you are confusing two concepts:

PostgreSQL clustering of a table

Clustering a table according to an index in PostgreSQL aligns the order of table rows (stored in a heap table) to the order in the index at the time of clustering. From the docs:

Clustering is a one-time operation: when the table is subsequently updated, the changes are not clustered. http://www.postgresql.org/docs/9.3/static/sql-cluster.html

Clustering potentially (often) improves query speed for range queries because the selected rows are stored nearby in the heap table by coincidence. There is nothing that guarantees this order! Consequently the optimizer cannot assume that it is true.

E.g. if you insert a new row that fulfills your where clause it might be inserted at any place in the table — e.g. where rows for 1990 are stored. Hence, this assumtion doesn't hold true:

but since the table is clustered by that attribute, shouldn't it be able to really quickly binary > search through the table to the point where date > '2001-01-01' and return all results after that?

This brings us to the other concept you mentioned:

Clustered Indexes

This is something completely different, not supported by PostgreSQL at all but by many other databases (SQL Server, MySQL with InnoDB and also Oracle where it is called 'Index Organized Table').

In that case, the table data itself is stored in an index structure — there is no separate heap structure! As it is an index, the order is also maintained for each insert/update/delete. Hence your assumption would hold true and indeed I'd expect the above mentioned database to behave as you would expect it (given the date column is the clustering key!).

Hope that clarifies it.

Markus Winand
  • 8,371
  • 1
  • 35
  • 44
  • @josh-hull I've added your suggested edits manually as is was already rejected by the crowd. Stupid System. See also here: http://meta.stackexchange.com/questions/185472/can-i-accept-an-already-rejected-edit-to-my-own-post – Markus Winand Dec 06 '13 at 14:41
  • So what are the use-cases for clustering a table in PostgreSQL? I am reading on this topic since last 12 hours and I am utterly confused now. – Kamil Latosinski Mar 29 '17 at 20:31
  • @KamilLatosinski I've never used PostgreSQL clustering for anything. It's an old feature and I doubt it works well with more modern features (like Index Only Scan). I guess it's use is limited to a very few corner cases nowadays. – Markus Winand Mar 30 '17 at 09:18