33

I'm having trouble regarding speed in a SELECT query on a Postgres database.

I have a table with two integer columns as key: (int1,int2) This table has around 70 million rows.

I need to make two kind of simple SELECT queries in this environment:

SELECT * FROM table WHERE int1=X;
SELECT * FROM table WHERE int2=X;

These two selects returns around 10.000 rows each out of these 70 million. For this to work as fast as possible I thought on using two HASH indexes, one for each column. Unfortunately the results are not that good:

                                                               QUERY PLAN                                                               
----------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on lec_sim  (cost=232.21..25054.38 rows=6565 width=36) (actual time=14.759..23339.545 rows=7871 loops=1)
   Recheck Cond: (lec2_id = 11782)
   ->  Bitmap Index Scan on lec_sim_lec2_hash_ind  (cost=0.00..230.56 rows=6565 width=0) (actual time=13.495..13.495 rows=7871 loops=1)
         Index Cond: (lec2_id = 11782)
 Total runtime: 23342.534 ms
(5 rows)

This is an EXPLAIN ANALYZE example of one of these queries. It is taking around 23 seconds. My expectations are to get this information in less than a second.

These are some parameters of the postgres db config:

work_mem = 128MB
shared_buffers = 2GB
maintenance_work_mem = 512MB
fsync = off
synchronous_commit = off
effective_cache_size = 4GB

Any help, comment or thought would be really appreciated.

Thank you in advance.

Craig Ringer
  • 307,061
  • 76
  • 688
  • 778
alexdemartos
  • 553
  • 1
  • 4
  • 8
  • As part of your total time how much of that is sending the data back to you? Are you running the query on the same machine as the database or are you going over the wire? – Kuberchaun Nov 05 '12 at 15:13
  • 1
    @JustBob: the time reported in the explain output is the time for preparing the query *on the server* (without client side roundtrips) –  Nov 05 '12 at 15:14
  • 1
    Hash indexes are not very efficient in PostgreSQL. Did you try regular B-Tree indexes? Do you have one index for each column, or a combined index on both? For which of the two statements is the posted execution plan? –  Nov 05 '12 at 15:16
  • JustBob, a_horse_with_no_name answered you better than I would. I guess for me to receive the data is pretty instant since there are 2 floats only as data. – alexdemartos Nov 05 '12 at 15:20
  • a_horse_with_no_name, at first they were the B-Tree indexes, and as they were running slow I switched to hash indexes. There are 3 indexes actually: the main one (int1, int2) as B-Tree, then one hash index for each column: hash (int1), hash (int2). Thanks. – alexdemartos Nov 05 '12 at 15:21
  • 3
    The index lookup here was very fast -- all the time was spent retrieving the actual rows. 23 seconds / 7871 rows = 2.9 milliseconds per row, which is reasonable for retrieving data that's scattered across the disk subsystem. Seeks are slow; you can a) fit your dataset in RAM, b) buy SSDs, or c) organize your data ahead of time to minimize seeks. – willglynn Nov 05 '12 at 15:58
  • @willglynn: Thank you very much, I didn't realize the timing was being spent on seeking the data instead of looking for the index. So, forgeting about the indexes-related wonderings, I'd like to see how could I organize this data to take it faster from the HD. Maybe I could order the rows by (int1) in one table, and have a copy of the table ordered by (int2), and perform the SELECT query on these two tables depending on the key index I am looking for. Any better ideas here? Would it work faster? Thank you so much. – alexdemartos Nov 05 '12 at 16:03
  • 3
    Two tables is an option, particularly if you `CLUSTER` both of them. However, PostgreSQL 9.2 adds a feature called [index only scans](http://wiki.postgresql.org/wiki/What%27s_new_in_PostgreSQL_9.2#Index-only_scans) that is particularly helpful here -- make a `btree` index over all the columns of interest (which PostgreSQL automatically keeps in-order), and the query can (probably) be answered using only the index with no extra seeks. – willglynn Nov 05 '12 at 16:10

4 Answers4

38

Extracting my comments into an answer: the index lookup here was very fast -- all the time was spent retrieving the actual rows. 23 seconds / 7871 rows = 2.9 milliseconds per row, which is reasonable for retrieving data that's scattered across the disk subsystem. Seeks are slow; you can a) fit your dataset in RAM, b) buy SSDs, or c) organize your data ahead of time to minimize seeks.

PostgreSQL 9.2 has a feature called index-only scans that allows it to (usually) answer queries without accessing the table. You can combine this with the btree index property of automatically maintaining order to make this query fast. You mention int1, int2, and two floats:

CREATE INDEX sometable_int1_floats_key ON sometable (int1, float1, float2);
CREATE INDEX sometable_int2_floats_key ON sometable (int2, float1, float2);

SELECT float1,float2 FROM sometable WHERE int1=<value>; -- uses int1 index
SELECT float1,float2 FROM sometable WHERE int2=<value>; -- uses int2 index

Note also that this doesn't magically erase the disk seeks, it just moves them from query time to insert time. It also costs you storage space, since you're duplicating the data. Still, this is probably the trade-off you want.

willglynn
  • 11,210
  • 48
  • 40
  • Thank you very much @willglynn. I think that was what I was looking for. I will give it a chance tomorrow and I will post how this performs. – alexdemartos Nov 06 '12 at 00:41
  • 2
    I am completely blown away by how amazing this is and how dramatically it impacted performance. Creating a simple index dropped database queries from ~8s to ~20ms on a production DB which containing around 20mil rows. – James Taylor Sep 26 '17 at 22:20
22

Thank you willglyn. As you noticed, the problem was the seeking through the HD and not looking up for the indexes. You proposed many solutions, like loading the dataset in RAM or buy an SSDs HD. But forgetting about these two, that involve managing things outside the database itself, you proposed two ideas:

  1. Reorganize the data to reduce the seeking of the data.
  2. Use PostgreSQL 9.2 feature "index-only scans"

Since I am under a PostgreSQL 9.1 Server, I decided to take option "1".

I made a copy of the table. So now I have the same table with the same data twice. I created an index for each one, the first one being indexed by (int1) and the second one by (int2). Then I clustered them both (CLUSTER table USING ind_intX) by its respective indexes.

I'm posting now an EXPLAIN ANALYZE of the same query, done in one of these clustered tables:

                                                         QUERY PLAN                                                            
---------------------------------------------------------------------------------------------------------------------------------------------  
Index Scan using lec_sim_lec2id_ind on lec_sim_lec2id  (cost=0.00..21626.82 rows=6604 width=36) (actual time=0.051..1.500 rows=8119 loops=1)
Index Cond: (lec2_id = 12300)  Total runtime:
1.822 ms (3 rows)

Now the seeking is really fast. I went down from 23 seconds to ~2 milliseconds, which is an impressive improvement. I think this problem is solved for me, I hope this might be useful also for others experiencing the same problem.

Thank you so much willglynn.

alexdemartos
  • 553
  • 1
  • 4
  • 8
  • 3
    If you have a static data set, you're done. If not, you'll want to maintain the sorted tables using triggers (so that there's one source of truth), and you'll need to periodically re-`CLUSTER` to maintain on-disk order as your data changes. – willglynn Nov 06 '12 at 15:30
3

I had a case of super slow queries where simple one to many joins (in PG v9.1) were performed between a table that was 33 million rows to a child table that was 2.4 billion rows in size. I performed a CLUSTER on the foreign key index for the child table, but found that this didn't solve my problem with query timeouts, for even the simplest of queries. Running ANALYZE also did not solve the issue.

What made a huge difference was performing a manual VACUUM on both the parent table and the child table. Even as the parent table was completing its VACUUM process, I went from 10 minute timeouts to results coming back in one second.

What I am taking away from this is that regular VACUUM operations are still critical, even for v9.1. The reason I did this was that I noticed autovacuum hadn't run on either of the tables for at least two weeks, and lots of upserts and inserts had occurred since then. It may be that I need to improve the autovacuum trigger to take care of this issue going forward, but what I can say is that a 640GB table with a couple of billion rows does perform well if everything is cleaned up. I haven't yet had to partition the table to get good performance.

Robert Casey
  • 1,511
  • 18
  • 19
  • 1
    You should investigate *why* autovacuum wasn't running. You probably have a some sessions in "idle in transaction" mode. Make sure you terminate transactions properly in your code. You probably should make your autovacuum settings more aggressive. –  Feb 21 '14 at 17:05
1

For a very simple and effective one liner, if you have fast solid-state storage on your postgres machine, try setting:

random_page_cost=1.0

In your in your postgresql.conf.

The default is random_page_cost=4.0 and this is optimized for storage with high seek times like old spinning disks. This changes the cost calculation for seeking and relies less on your memory (which could ultimately be going to swap anyway)

This setting alone improved my filtering query from 8 seconds down to 2 seconds on a long table with a couple million records.

The other major improvement came from making indexes with all of the booleen columns on my table. This reduced the 2 second query to about 1 second. Check @willglynn's answer for that.

Hope this helps!

Nick Woodhams
  • 11,977
  • 10
  • 50
  • 52