2

my first question on SO, so apologies in advance!

I have Postgresql (12.2) running on Windows Server 2016 and am experiencing poor performance with a simple self-join on 2 numeric columns. The table itself (form_content) is made up of 27 columns and approx 23,200,000 rows. The table stores responses by users using a form. Each form will generate multiple rows, and there is a hierarchy that is important e.g. row #5 may be a "parent" (relating to a section on the form) for row #s 6-10 (which may relate to responses to specific questions). I am limited somewhat in how the data is stored within this table. Each row has an "event_id" and a "parent_event_id". In the example above the "event_id" for row#5 will be the "parent_event_id" for row #s 6-10.

Therefore in order to extract the responses I am running the following query (for the purposes of this Q I am only including some of the columns within the select statement:

select
   fc.event_id,
   fc1.result_val
from
   form_content fc
join
   form_content fc1
on
   fc.event_id = fc1.parent_event_id
where
   fc.performed_dt_tm >= '2020-06-01'::timestamp

The above query takes between 2-3 minutes to run and there are approx 300,000 rows with performed_dt_tm >= 2020-06-01. I have indexes in place for both event_id and parent_event_id columns (and indexes for: (event_id,parent_event_id) and (parent_event_id,event_id)). I have run 'EXPLAIN ANALYZE' for the above query and it appears as if the table indexes are not being used:

QUERY PLAN                                                                                                                                                                                       |
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
Gather  (cost=1485180.36..2252551.11 rows=933343 width=16) (actual time=222023.060..224259.345 rows=415750 loops=1)                                                                              |
  Workers Planned: 2                                                                                                                                                                             |
  Workers Launched: 2                                                                                                                                                                            |
  ->  Parallel Hash Join  (cost=1484180.36..2158216.81 rows=388893 width=16) (actual time=221939.001..223549.551 rows=138583 loops=3)                                                            |
        Hash Cond: (fc1.parent_event_id = fc.event_id)                                                                                                                                           |
        ->  Parallel Index Only Scan using idx_pf_rfg_content_parenteventid on form_content fc1  (cost=0.56..559818.60 rows=9697340 width=8) (actual time=19.702..34247.142 rows=7761647 loops=3)|
              Heap Fetches: 99129                                                                                                                                                                |
        ->  Parallel Hash  (cost=1482201.74..1482201.74 rows=120564 width=8) (actual time=174513.622..174513.624 rows=98016 loops=3)                                                             |
              Buckets: 131072  Batches: 4  Memory Usage: 4544kB                                                                                                                                  |
              ->  Parallel Seq Scan on form_content fc  (cost=0.00..1482201.74 rows=120564 width=8) (actual time=8.798..174288.343 rows=98016 loops=3)                                           |
                    Filter: (performed_dt_tm >= '2020-06-01 00:00:00'::timestamp without time zone)                                                                                              |
                    Rows Removed by Filter: 7663631                                                                                                                                              |
Planning Time: 27.449 ms                                                                                                                                                                         |
Execution Time: 224663.439 ms 

I am relatively new to Postgresql and when running similar queries on MSSQL performance was much better.

Many thanks in advance!

amantadine
  • 21
  • 2
  • Do you have an index on `(event_id, parent_event_id)`? –  Jun 10 '20 at 13:33
  • Yes I do (and no change in the explain analyze output or query performance). Thanks for the response - I'll edit the initial question to reflect! – amantadine Jun 10 '20 at 13:35
  • What do you do with the 400K rows result set ? Do you do the same thing as in SQL Server ? – pifor Jun 10 '20 at 13:51
  • Do you mean the 300K rows that have dates >= June 1st? If so I included the date as in practice I would be running this query with a date restriction. I mentioned the number of rows that are dated on or after June 1st to get an idea of the size of the self-join. – amantadine Jun 10 '20 at 13:55
  • 1
    Please in code questions give a [mre]--including cut & paste & runnable minimal code & minimal representative data given as code. For SQL that includes DBMS & DDL, including constraints, indexes & tabular initialization. For SQL performance that includes EXPLAIN results & statistics. Please research & summarize. For SQL that includes basics of optimization/performance--immediately leading to indexes, plans, statistics & SARGability. [Tips for asking a good SQL question](https://meta.stackoverflow.com/a/271056/3404097) Ask re optimization after you have learned & applied those basics. [ask] – philipxy Jun 10 '20 at 13:56
  • For this specific query, you could create a partial index such as `create index on form_content(dt_tm) where dt_tm >= '2020-06-01:00:00:00'`. But it cannot be a general solution if this date parameter can change. – pifor Jun 10 '20 at 14:07

1 Answers1

0

Hitting an index a few hundred thousand times is not free. The hash joins are probably actually faster than that.

Your slowest step here is getting rows based on the timestamp. Perhaps an index on (peformed_dt_tm) would help. Better yet, to get an index-only scan, try one on (performed_dt_tm, event_id).

Also, you should VACUUM ANALYZE the table. It is not obvious that it is poorly vacuumed (Heap Fetches: 99129 out of 7761647 rows is not obviously bad, but it could be better) but it is nice to know for sure it is well-vacuumed when doing analytical work, as it removes one more unknown variable.

Knowing it was faster in MSSQL doesn't help us, unless you can show us the plan that was used there.

jjanes
  • 37,812
  • 5
  • 27
  • 34