2

I've been tinkering/reading for a while, but can't find any optimizations that work here... I've indexed the relevant ids in the joins, I tried a manual vacuum, and I also tried clustering on an index so that maybe the query optimizer wouldn't think it's more efficient to scan the whole table because of some scattered rows (though I don't really know much about query planning).

I am trying to get join results for a single id (for debugging purposes). I found that queries for some single ids take ~2 minutes, while most (99%?) return in under 1 second. Here are some explain analyzes (I changed some names with sed for confidentiality):

main=> explain analyze SELECT e.customer_id, l.*
            FROM abc.encounter e 
            JOIN abc.log l
            ON e.encounter_id = l.encounter_id
            AND e.customer_id = '1655563';
                                                                     QUERY PLAN                                                                      
-----------------------------------------------------------------------------------------------------------------------------------------------------
 Hash Join  (cost=2751.69..2566740.95 rows=13262 width=75) (actual time=122038.725..226694.004 rows=249 loops=1)
   Hash Cond: (l.encounter_id = e.encounter_id)
   ->  Seq Scan on log l  (cost=0.00..2190730.92 rows=99500192 width=66) (actual time=0.005..120825.675 rows=99500192 loops=1)
   ->  Hash  (cost=2742.81..2742.81 rows=710 width=18) (actual time=0.309..0.309 rows=89 loops=1)
         Buckets: 1024  Batches: 1  Memory Usage: 13kB
         ->  Bitmap Heap Scan on encounter e  (cost=17.93..2742.81 rows=710 width=18) (actual time=0.037..0.197 rows=89 loops=1)
               Recheck Cond: (customer_id = '1655563'::text)
               Heap Blocks: exact=46
               ->  Bitmap Index Scan on idx_abc_encounter_customer_id  (cost=0.00..17.76 rows=710 width=0) (actual time=0.025..0.025 rows=89 loops=1)
                     Index Cond: (customer_id = '1655563'::text)
 Planning time: 0.358 ms
 Execution time: 226694.311 ms
(12 rows)

main=> explain analyze SELECT e.customer_id, l.*
            FROM abc.encounter e 
            JOIN abc.log l
            ON e.encounter_id = l.encounter_id
            AND e.customer_id = '121652491';
                                                                      QUERY PLAN                                                                      
------------------------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=36.67..53168.06 rows=168 width=75) (actual time=0.090..0.422 rows=11 loops=1)
   ->  Index Scan using idx_abc_encounter_customer_id on encounter e  (cost=0.43..40.53 rows=9 width=18) (actual time=0.017..0.047 rows=17 loops=1)
         Index Cond: (customer_id = '121652491'::text)
   ->  Bitmap Heap Scan on log l  (cost=36.24..5888.00 rows=1506 width=66) (actual time=0.016..0.017 rows=1 loops=17)
         Recheck Cond: (encounter_id = e.encounter_id)
         Heap Blocks: exact=6
         ->  Bitmap Index Scan on idx_abc_log_encounter_id  (cost=0.00..35.86 rows=1506 width=0) (actual time=0.013..0.013 rows=1 loops=17)
               Index Cond: (encounter_id = e.encounter_id)
 Planning time: 0.361 ms
 Execution time: 0.478 ms
(10 rows)

I'll also add that, for a long running query, even if only 250 rows are returned after 2 minutes, adding "LIMIT 100" can make the query return instantly. I investigated whether or not the speed was related to the amount of data returned by the query, and I don't see any obvious trend. I can't help feeling that Postgres is quite mistaken (100x?) about which of its methods will be faster. What are my options here?

dcc310
  • 1,038
  • 1
  • 9
  • 13
  • `SELECT e.customer_id, l.* FROM abc.encounter e JOIN abc.log l ON e.encounter_id = l.encounter_id WHERE e.customer_id = '1655563'` could you try this query and see if it is faster? – Haleemur Ali Jan 17 '17 at 20:45
  • @HaleemurAli, that format takes just as long. I've also tried a "WHERE id in (subselect)" style and plain-old in clauses as well. – dcc310 Jan 17 '17 at 20:54
  • did you get similar performance with `WHERE l.customer_id = '1655563'`? – Haleemur Ali Jan 17 '17 at 20:56
  • @HaleemurAli, customer_id is not a column of this "log" table. For our purposes, that table only has encounter_id. – dcc310 Jan 17 '17 at 21:09
  • sorry my typo, i meant to write l.encounter_id = '1655563' in my comment above. That should be the same result – Haleemur Ali Jan 17 '17 at 21:18

1 Answers1

3

PostgreSQL's row count estimates for encounter are off by a factor of almost 10. My first attempt would be to improve that.

For that you can change the statistics target for the column:

ALTER TABLE abc.encounter ALTER customer_id SET STATISTICS 1000;

A subsequent ANALYZE will then collect better statistics for that column. If 1000 is not enough, try 10000. With a better row count estimate you have better chances to get the best plan.

If the cost of the repeated index scans for the nested loop join is still overestimated in comparison with a sequential scan, you can lower the parameter random_page_cost from its default value 4 to something closer to seq_page_cost (default 1). That will bias PostgreSQL in favor of the nested loop join.

Laurenz Albe
  • 209,280
  • 17
  • 206
  • 263
  • This solved my issue, thanks! Do you know why Postgres had such bad estimates? Is this just something that happens now and then, or does one have to make mistakes with configuration etc to cause it to get confused? If it helps, this is an Amazon RDS instance, and I haven't toyed with any parameters. – dcc310 Jan 18 '17 at 19:29
  • 2
    The default value of 100 for `default_statistics_target` tries to strike a middle ground: high enough to get decent statistics, low enough to not cause immoderate workload during statistics collection. Usually the default value works fine, but sometimes it has to be tuned. – Laurenz Albe Jan 19 '17 at 07:44