We are trying to query millions of rows of data from 1 table (14 columns) in under 2 seconds if possible. This table is composed of more than 10 million records and is growing exponentially as live data is being written into it every second, every day.
The query’s we write to this table typically look like:
SELECT "Scale",
COUNT("Weight") AS "Total Punnets",
MAX("Time") - MIN("Time") AS "Delta Time", database
FROM global_packing_data
WHERE "Date" >= '2022-05-22'AND "Date" <= '2022-08-22'
AND global_packing_data.database IN ('line_1_computer','line_2_computer')
GROUP BY "Scale", database
The execution plan for this specific query is as follows:
Finalize GroupAggregate (cost=1499111.79..1499128.59 rows=120 width=44) (actual time=44754.797..44769.894 rows=30 loops=1)"
Group Key: ""Scale"", database"
Buffers: shared hit=694 read=915799"
-> Gather Merge (cost=1499111.79..1499125.59 rows=120 width=44) (actual time=44754.779..44769.854 rows=60 loops=1)"
Workers Planned: 1"
Workers Launched: 1"
Buffers: shared hit=694 read=915799"
-> Sort (cost=1498111.78..1498112.08 rows=120 width=44) (actual time=44722.956..44722.967 rows=30 loops=2)"
Sort Key: ""Scale"", database"
Sort Method: quicksort Memory: 27kB"
Buffers: shared hit=694 read=915799"
Worker 0: Sort Method: quicksort Memory: 27kB"
-> Partial HashAggregate (cost=1498106.44..1498107.64 rows=120 width=44) (actual time=44722.875..44722.888 rows=30 loops=2)"
Group Key: ""Scale"", database"
Batches: 1 Memory Usage: 40kB"
Buffers: shared hit=679 read=915799"
Worker 0: Batches: 1 Memory Usage: 40kB"
-> Parallel Seq Scan on global_packing_data (cost=0.00..1379067.81 rows=9523090 width=32) (actual time=374.969..42710.007 rows=8140736 loops=2)"
Filter: ((""Date"" >= '2022-05-22'::date) AND (""Date"" <= '2022-08-22'::date) AND (database = ANY ('{line_1_computer,line_2_computer}'::text[])))"
Rows Removed by Filter: 14327912"
Buffers: shared hit=679 read=915799"
Planning:"
Buffers: shared hit=7 read=4"
Planning Time: 1.484 ms"
JIT:"
Functions: 21"
Options: Inlining true, Optimization true, Expressions true, Deforming true"
Timing: Generation 3.144 ms, Inlining 117.817 ms, Optimization 386.307 ms, Emission 249.100 ms, Total 756.369 ms"
Execution Time: 44771.410 ms"
Attempts to improve performance:
Created a materialized view based on the table. This worked amazingly initially, but now the performance is no longer stellar. Not sure if it has to do with the increased size of the view since it was initially created or something else.
Created a multi-column indexes base on the 2 columns that will always be part of the WHERE in all our query’s.
CREATE INDEX global_packing_data_index
ON global_packing_data("Date", database);
- Made changes to the PostgreSQL server config file:
max_connections = 100
shared_buffers = 2GB
wal_buffers = 16MB
work_mem = 20971kB
The PostgreSQL server is hosted on an Amazon EC2 Instance. The virtual machine specs are: ubuntu, 2 vCPU, 8 GB RAM
Any ideas on how we can improve performance further so that we can query all this data under 2 seconds?