0

I'm using Azure Postgress DB to store around 150 mil records. Specs of the DB are: 160GB storage and around 4GB RAM. The Dataset is around 80GB when loaded into the DB. The data set will not going to grow but will remain around 80 - 85 GB.

Here's the table definition:

CREATE TABLE properties(
    PropertyId bigint,
    Address text,
    Latitude double precision,
    Longitude double precision,
    Rooms int,
    BathRooms int
)

and a few more columns

Most of the queries are based on the following 4 fields:

Address (text) 
Longitude (double)
Latitude (double)
PropertyId (big int)

I've implemented indexes on all these fields. For the Address - GIN, others B-Tree.

CREATE INDEX address_idx ON properties USING GIN (Address gin_trgm_ops);
CREATE INDEX propertyid_idx ON properties(PropertyId);
CREATE INDEX latitude_idx ON properties(Latitude);
CREATE INDEX longitude_idx ON properties(Longitude);

But the problem is still the queries are slow i.e.

select * from my_table 
where Latitude between x and y
and Longitude between p and q
and address like '%address%';

takes minutes...

I analysed the queries using explain analyse, which shows the queries indeed use indexes.

"Bitmap Heap Scan on properties  (cost=34256.04..34901.54 rows=10 width=561) (actual time=24664.562..32007.752 rows=35 loops=1)"
"  Recheck Cond: ((Address ~~ '%3365%'::text) AND (Longitude >= '-90.5'::double precision) AND (Longitude <= '-90'::double precision))"
"  Rows Removed by Index Recheck: 1123"
"  Filter: ((propertylatitude >= '38'::double precision) AND (propertylatitude <= '39'::double precision))"
"  Rows Removed by Filter: 64"
"  Heap Blocks: exact=1213"
"  Buffers: shared hit=181 read=6478"
"  I/O Timings: read=31160.388"
"  ->  BitmapAnd  (cost=34256.04..34256.04 rows=161 width=0) (actual time=24660.058..24660.059 rows=0 loops=1)"
"        Buffers: shared hit=169 read=5277"
"        I/O Timings: read=23836.224"
"        ->  Bitmap Index Scan on address_idx  (cost=0.00..135.75 rows=12233 width=0) (actual time=6892.077..6892.077 rows=12973 loops=1)"
"              Index Cond: (Address ~~ '%3365%'::text)"
"              Buffers: shared hit=168 read=321"
"              I/O Timings: read=6815.544"
"        ->  Bitmap Index Scan on longitude_idx  (cost=0.00..34120.04 rows=1627147 width=0) (actual time=17763.265..17763.265 rows=1812752 loops=1)"
"              Index Cond: ((Longitude >= '-90.5'::double precision) AND (Longitude <= '-90'::double precision))"
"              Buffers: shared hit=1 read=4956"
"              I/O Timings: read=17020.681"
"Planning Time: 0.267 ms"
"Execution Time: 32008.085 ms"

So my questions are,

  1. Are there any ways to improve the performance (SQL wise)? different indexing algorithms or strategies?
  2. Is there a rule of thumb when calculating the memory and storage requirements given the data size? What could be the minimum adequate hardware requirements to get results under 2 seconds?
KTB
  • 1,499
  • 6
  • 27
  • 43

2 Answers2

2

The query uses the indexes and the estimates are fine.

Your time is almost entirely spent doing I/O, and you have an I/O speed of about 2MB per second, which is terrible. Perhaps your I/O subsystem is overloaded; you may want to check that.

With this kind of storage, your only chance would be to cache the whole table and its indexes in RAM. Or you can get faster storage.

Laurenz Albe
  • 209,280
  • 17
  • 206
  • 263
  • So if i understood correctly, having a better RAM will reduce the #of I/Os? faster storage means SSD like? is there a formulae to calculate the required RAM for this size of a dataset? – KTB Sep 07 '21 at 15:02
  • Having *more* RAM would be the key - enough to cache your whole database. But even with lame spinning disks the performance would be much better. What kind of hardware is that? Perhaps your I/O system is totally overloaded. – Laurenz Albe Sep 07 '21 at 15:08
  • I'm using the Azure PostgresSQL DB - single server. May be I should increase the RAM to 20GB? – KTB Sep 07 '21 at 15:54
  • More RAM would not harm, but see that you get better I/O performance. – Laurenz Albe Sep 07 '21 at 16:09
2

The scan on longitude_idx is astonishingly slow. The IO done for a simple range scan on a btree index is often mostly sequential, unless the index has seen very heavy turnover over time and so got very fragmented. But maybe whatever storage class you have this database on has no read-ahead optimization implemented so sequential scans have no privilege over random reads.

You could try to REINDEX the table to see if that makes a difference. It might make sense to instead cluster the table on either lat or long, which should automatically reindex it, but you might not have enough scratch space to do this.

jjanes
  • 37,812
  • 5
  • 27
  • 34
  • seems the problem is with the hardware support. 4GB RAM is not enough at all to handle this dataset. May be I should try with 20GB for example – KTB Sep 07 '21 at 15:57
  • It isn't an either/or thing. 4GB RAM might be fine if your IO wasn't so remarkably bad. Having your data reorganized as I described might also improve things a lot (but probably not 1000-fold) with your current RAM and IO. – jjanes Sep 07 '21 at 16:26
  • May be I got your point wrong. What you are suggesting is to see whether re-indexing would reduce the I/O? I found this article discusses on improving I/O: https://severalnines.com/blog/tuning-io-operations-postgresql – KTB Sep 07 '21 at 17:04
  • On a freshly created btree index, the physical order of leaf pages is the same as the logical order, so reading 4956 of them in logical order should result in fast sequential reads rather than slow individual random reads. – jjanes Sep 08 '21 at 14:32