Postgres hash join vs nested loop decision making

Question

I have this very long query which I'll summarize here and paste in total on the bottom:

select * from
a
left join b t1 on a.x = b.x
left join b t2 on a.y = b.x
left join b t3 on a.z = b.x
left join c on a.1 = c.1 and a.2 = c.2 and a.3 = c.3 --call this predicate 1
where c.z is null

a and c have primary key 1,2,3 unclustered a.x y or z can be null You'll see in the things linked below that a is 40k rows, c is 500k rows, b is 7k rows. This query takes 10 minutes. It's faster to just do it in excel by hand. My row count estimates are all wrong even after I ran vacuum full analyze and it's got nested loops where it shouldn't

Here it is in its entirety https://explain.depesz.com/s/w2uN

When I remove predicate 1, all the nested loops are gone, the row estimates are still wrong. It also takes less than a second to run after I RESET postgre so there is no cache

https://explain.depesz.com/s/O7R

Any ideas on how to force the hash join? Or I can build 3 indexes on a 40k table I truncate and load into all the time, because this is really a staging table that gets refreshed every week. Seems like overkill, but probably doesn't hurt any. Also there's very little on how to get the row counts right in the planner other than vacuum analyze. Any ideas on this?

Finally, here's the code in its entirety

SELECT
    ar.cocd,
    ar.customer,
    ar.sales_doc,
    ar.documentno,
    ar.headertext,
    ar.clrng_doc,
    ar.typ,
    ar.net_due_dt,
    ar.amt,
    ar.lcurr,
    ar.amount_in_dc,
    ar.curr,
    ar.text,
    ar.doc_date,
    ar.clearing,
    ar.po_number,
    ar.payt,
    ar.st,
    ar.arrear,
    ar.gl,
    ar.user_name,
    ar.tcod,
    ar.itm,
    ar.inv_ref,
    ar.amount_in_loccurr2,
    ar.pmnt_date,
    ar.pk,
    ar.pstng_date,
    ar.account,
    ar.accty,
    ar.aging_bucket,
    ar.billdoc,
    ar.ftyp,
    ar.general_ledger_amt,
    ar.offstacct,
    ar.pmtmthsu,
    ar.purchdoc,
    ar.rcd,
    ar.transtype,
    ar.ym,
    COALESCE(ar_f2.branch, ar_f2.subbranch, ar_f2.account) AS forecast_company,
    ar_f.customer_name AS paying_company,
    ar_f3.customer_name AS shipping_company
   FROM h_ar_open ar
     LEFT JOIN h_ar_forecast ar_f ON ar.customer = ar_f.customer
     LEFT JOIN h_ar_forecast ar_f2 ON ar.soldto::double precision = ar_f2.customer
     LEFT JOIN h_ar_forecast ar_f3 ON ar.shipto::double precision = ar_f3.customer
     LEFT JOIN h_ar_hist hist ON ar.cocd = hist.cocd AND ar.itm = hist.itm AND ar.documentno = hist.documentno
  WHERE hist.documentno IS NULL;

Did I do something so repugnant that it would merit a down vote? I just can't imagine why one query runs in seconds and the other in potentially hours? — Henrietta Martingale, Nov 28 '17 at 21:18
I added an index on b.x (or h_ar_forecast.customer if you read the long code) and it's still nested loops but it's a 3 seconds now. Are indexes really that responsive? https://explain.depesz.com/s/y9VY — Henrietta Martingale, Nov 28 '17 at 21:57
`ON ar.soldto::double precision = ar_f2.customer` why the cast? In a join-condition! — wildplasser, Nov 29 '17 at 13:03
I know, I don't know what to do about it. I really just want to make h_ar_forecast.customer text so there's no casting, but then I'd have to drop and create 10 different views that depend on this table and they'd have to be dropped and recreated all in the right order. Perhaps this is a question in its own right, but is there an easier way to do it than to copy all the definitions from pgadmin and move them around? — Henrietta Martingale, Nov 29 '17 at 23:52

score 2 · Accepted Answer · answered Nov 28 '17 at 23:26

The index from your comment helps a lot, because it makes the nested loop join so much faster. 46417 sequential scans over 7000 rows suck.

Your problem is the misestimate.

Maybe you can force the join order with a trick like:

SELECT ...
FROM (SELECT ...
      FROM a
         LEFT JOIN b t1 ...
         LEFT JOIN b t3 ...
         LEFT JOIN b t3 ...
      OFFSET 0) x
   LEFT JOIN c ...;

If c is joined last, the misestimate doesn't harm.

Postgres hash join vs nested loop decision making

1 Answers1