I'm using Azure Postgress DB to store around 150 mil records. Specs of the DB are: 160GB storage and around 4GB RAM. The Dataset is around 80GB when loaded into the DB. The data set will not going to grow but will remain around 80 - 85 GB.
Here's the table definition:
CREATE TABLE properties(
PropertyId bigint,
Address text,
Latitude double precision,
Longitude double precision,
Rooms int,
BathRooms int
)
and a few more columns
Most of the queries are based on the following 4 fields:
Address (text)
Longitude (double)
Latitude (double)
PropertyId (big int)
I've implemented indexes on all these fields. For the Address - GIN, others B-Tree.
CREATE INDEX address_idx ON properties USING GIN (Address gin_trgm_ops);
CREATE INDEX propertyid_idx ON properties(PropertyId);
CREATE INDEX latitude_idx ON properties(Latitude);
CREATE INDEX longitude_idx ON properties(Longitude);
But the problem is still the queries are slow i.e.
select * from my_table
where Latitude between x and y
and Longitude between p and q
and address like '%address%';
takes minutes...
I analysed the queries using explain analyse, which shows the queries indeed use indexes.
"Bitmap Heap Scan on properties (cost=34256.04..34901.54 rows=10 width=561) (actual time=24664.562..32007.752 rows=35 loops=1)"
" Recheck Cond: ((Address ~~ '%3365%'::text) AND (Longitude >= '-90.5'::double precision) AND (Longitude <= '-90'::double precision))"
" Rows Removed by Index Recheck: 1123"
" Filter: ((propertylatitude >= '38'::double precision) AND (propertylatitude <= '39'::double precision))"
" Rows Removed by Filter: 64"
" Heap Blocks: exact=1213"
" Buffers: shared hit=181 read=6478"
" I/O Timings: read=31160.388"
" -> BitmapAnd (cost=34256.04..34256.04 rows=161 width=0) (actual time=24660.058..24660.059 rows=0 loops=1)"
" Buffers: shared hit=169 read=5277"
" I/O Timings: read=23836.224"
" -> Bitmap Index Scan on address_idx (cost=0.00..135.75 rows=12233 width=0) (actual time=6892.077..6892.077 rows=12973 loops=1)"
" Index Cond: (Address ~~ '%3365%'::text)"
" Buffers: shared hit=168 read=321"
" I/O Timings: read=6815.544"
" -> Bitmap Index Scan on longitude_idx (cost=0.00..34120.04 rows=1627147 width=0) (actual time=17763.265..17763.265 rows=1812752 loops=1)"
" Index Cond: ((Longitude >= '-90.5'::double precision) AND (Longitude <= '-90'::double precision))"
" Buffers: shared hit=1 read=4956"
" I/O Timings: read=17020.681"
"Planning Time: 0.267 ms"
"Execution Time: 32008.085 ms"
So my questions are,
- Are there any ways to improve the performance (SQL wise)? different indexing algorithms or strategies?
- Is there a rule of thumb when calculating the memory and storage requirements given the data size? What could be the minimum adequate hardware requirements to get results under 2 seconds?