8

Generally speaking, is there a performance difference between using a JOIN to select rows versus an EXISTS where clause? Searching various Q&A web sites suggests that a join is more efficient, but I recall learning a long time ago that EXISTS was better in Teradata.

I do see other SO answers, like this and this, but my question is specific to Teradata.

For example, consider these two queries, which return identical results:

select   svc.ltv_scr, count(*) as freq
from     MY_BASE_TABLE svc
join     MY_TARGET_TABLE x
on       x.srv_accs_id=svc.srv_accs_id
group by 1
order by 1

-and-

select   svc.ltv_scr, count(*) as freq
from     MY_BASE_TABLE svc
where exists(
    select 1
    from   MY_TARGET_TABLE x
    where  x.srv_accs_id=svc.srv_accs_id)
group by 1
order by 1

The primary index (unique) on both tables is 'srv_accs_id'. MY_BASE_TABLE is rather large (200 million rows) and MY_TARGET_TABLE relatively small (200,000 rows).

There is one significant difference in the EXPLAIN plans: The first says the two tables are joined "by way of a RowHash match scan" and the second says "by way of an all-rows scan". Both say it is "an all-AMPs JOIN step" and the total estimated time is identical (0.32 seconds).

Both queries perform the same (I'm using Teradata 13.10).

A similar experiment to find non-matches comparing a LEFT OUTER JOIN with a corresponding IS NULL where clause to a NOT EXISTS sub-query does show a performance difference:

select   svc.ltv_scr, count(*) as freq
from     MY_BASE_TABLE svc
left outer join MY_TARGET_TABLE x
on       x.srv_accs_id=svc.srv_accs_id
where    x.srv_accs_id is null
group by 1
order by 1

-and-

select   svc.ltv_scr, count(*) as freq
from     MY_BASE_TABLE svc
where not exists(
    select 1
    from   MY_TARGET_TABLE x
    where  x.srv_accs_id=svc.srv_accs_id)
group by 1
order by 1 

The second query plan is faster (2.21 versus 2.14 seconds as described by EXPLAIN).

My example may be too trivial to see a difference; I'm just looking for coding guidance.

Community
  • 1
  • 1
BellevueBob
  • 9,498
  • 5
  • 29
  • 56
  • Do realise that the first set of queries (EXISTS) fetches .1% of the rows, the second set (NOT EXISTS) will fetch 99.9% of the rows. In terms of the number of fetched pages, the first query can use an index to minimise the number of page fetches, in the second one effectively **all** pages will be needed. That's why the planner chooses to scan all the pages in the second case. The difference between *left join ... NULL* and *NOT EXISTS* does not look too significant to me. (order? cache priming?) Disclaimer: I don't know teradata. – wildplasser Dec 03 '12 at 23:22
  • Yes, I realize that; I was just tried to make up an example to illustrate my question. Also, I just sat through a Teradata presentation on `EXPLAIN` and have learned that those are relative "cost" estimates and have nothing to do with "time". – BellevueBob Dec 04 '12 at 19:16

1 Answers1

7

NOT EXISTS is more efficient than using a LEFT OUTER JOIN to exclude records that are missing from the participating table using an IS NULL condition because the optimizer will elect to use an EXCLUSION MERGE JOIN with the NOT EXISTS predicate.

While your second test did not yield impressive results for the data sets you were using the performance increase from NOT EXISTS over a LEFT JOIN is very noticeable as your data volumes increase. Keep in mind that the tables will need to be hash distributed by the columns that participate in the NOT EXISTS join just like they would in the LEFT JOIN. Therefore, data skew can impact the performance of the EXCLUSION MERGE JOIN.

EDIT:

Typically, I would defer to EXISTS as a replacement for IN instead of using it for re-writing a join solution. This is especially true when the column(s) participating in the logical comparison can be NULL. That's not to say you couldn't use EXISTS in place of an INNER JOIN. Instead of an EXCLUSION JOIN you will end up with an INCLUSION JOIN. The INNER JOIN is in essence an inclusion join to begin with. I'm sure there are some nuances that I am overlooking but you can find those in the manuals if you wish to take the time to read them.

Rob Paller
  • 7,736
  • 29
  • 26
  • Great answer with respect to the second part of my question on `LEFT OUTER JOIN`. Any comment on the first part (`JOIN` versus `EXISTS`)? – BellevueBob Dec 04 '12 at 19:13
  • 1
    Personally , I think this is a lousy anwer. Any sane query analyser would detect the anti-join. (if there would be a difference in timing, the plan generator would be very wrong). btw: in the old time, "not exists" would always be faster, just because it came first, and the other variants (though equivalent) would perform worse. The "JOIN...NULL" form is relatively new. The IN variant has always been a loser, because duplicates have to be avoided by the subquery (apart from the NULL problem) – wildplasser Dec 04 '12 at 23:44
  • 1
    Thanks for the clarification to my response @wildplasser. (Hence the upvote on your comment.) I failed to mention the impact that duplicates could introduce when using the `IN` clause and the fact that `NOT EXISTS` predates the use of `JOIN ... NULL` in the SQL lexicon. – Rob Paller Dec 05 '12 at 15:05
  • 1
    There is an additional "human" advantage to the EXISTS syntax: it does not *pollute* the outer query with its local rangetable (but, beeing a correlated subquery, it does have to refer to the outer query, of course) This makes `SELECT * FROM a WHERE EXISTS (SELECT * FROM B where b.xx= a.yy)` possible; The `select *` would otherwise contain the entire rangetable, including duplicated names for joined columns. – wildplasser Dec 05 '12 at 15:14