my first question on SO, so apologies in advance!
I have Postgresql (12.2) running on Windows Server 2016 and am experiencing poor performance with a simple self-join on 2 numeric columns. The table itself (form_content) is made up of 27 columns and approx 23,200,000 rows. The table stores responses by users using a form. Each form will generate multiple rows, and there is a hierarchy that is important e.g. row #5 may be a "parent" (relating to a section on the form) for row #s 6-10 (which may relate to responses to specific questions). I am limited somewhat in how the data is stored within this table. Each row has an "event_id" and a "parent_event_id". In the example above the "event_id" for row#5 will be the "parent_event_id" for row #s 6-10.
Therefore in order to extract the responses I am running the following query (for the purposes of this Q I am only including some of the columns within the select statement:
select
fc.event_id,
fc1.result_val
from
form_content fc
join
form_content fc1
on
fc.event_id = fc1.parent_event_id
where
fc.performed_dt_tm >= '2020-06-01'::timestamp
The above query takes between 2-3 minutes to run and there are approx 300,000 rows with performed_dt_tm >= 2020-06-01. I have indexes in place for both event_id and parent_event_id columns (and indexes for: (event_id,parent_event_id) and (parent_event_id,event_id)). I have run 'EXPLAIN ANALYZE' for the above query and it appears as if the table indexes are not being used:
QUERY PLAN |
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
Gather (cost=1485180.36..2252551.11 rows=933343 width=16) (actual time=222023.060..224259.345 rows=415750 loops=1) |
Workers Planned: 2 |
Workers Launched: 2 |
-> Parallel Hash Join (cost=1484180.36..2158216.81 rows=388893 width=16) (actual time=221939.001..223549.551 rows=138583 loops=3) |
Hash Cond: (fc1.parent_event_id = fc.event_id) |
-> Parallel Index Only Scan using idx_pf_rfg_content_parenteventid on form_content fc1 (cost=0.56..559818.60 rows=9697340 width=8) (actual time=19.702..34247.142 rows=7761647 loops=3)|
Heap Fetches: 99129 |
-> Parallel Hash (cost=1482201.74..1482201.74 rows=120564 width=8) (actual time=174513.622..174513.624 rows=98016 loops=3) |
Buckets: 131072 Batches: 4 Memory Usage: 4544kB |
-> Parallel Seq Scan on form_content fc (cost=0.00..1482201.74 rows=120564 width=8) (actual time=8.798..174288.343 rows=98016 loops=3) |
Filter: (performed_dt_tm >= '2020-06-01 00:00:00'::timestamp without time zone) |
Rows Removed by Filter: 7663631 |
Planning Time: 27.449 ms |
Execution Time: 224663.439 ms
I am relatively new to Postgresql and when running similar queries on MSSQL performance was much better.
Many thanks in advance!