1

I have a bit of a T-SQL conundrum that appears to be working but I was wondering if someone could attempt to give me a breakdown as to what is happening here.Consider the following script:

SELECT *
FROM TableA a
WHERE a.CustomerID NOT IN (SELECT b.CustomerID FROM TableB b WHERE a.CustomerID = b.CustomerID AND a.WorkOrder = b.WorkOrder)
AND a.[Date] > DATEADD(DD,-3,GETDATE())

I'm rather stumped as to how the compiler is not imploding on this script. How can it select where NOT IN on a subquery referencing the outer query? Get contents of TableA where CustomerID NOT IN CustomerID's from TableB etc... But when it finds a matching CustomerID in the subquery, the NOT IN kicks in and prevents the record from showing in the outer query select. I'm guessing this is where the compiler stops. But then because that specific CustomerID is not selected, it cannot join in the inner query, thus inner query does not select that CustomerID, then allowing the outer query to select that record? Yes? No? Falling down the rabbit hole? Is there a better way to write this?

Would appreciate if someone could elaborate on what happening here, or reference something that could explain. I could't really find anyone explaining this process, maybe not using the right search terms.

Thank you!

Xenophage
  • 85
  • 9

1 Answers1

3

It is called a "Correlated subquery" and "the subquery may be evaluated once for each row processed by the outer query".

So here, for each row of TableA, the subquery seeks the matching data from TableB and determines if the NOT IN condition is met. Then on to the next row in TableA to repeat that cycle until all relevant rows of TableA have been evaluated.

An alternative approach could be a "left excluding join" when you join the 2 tables but then ignore rows where a join exists.

SELECT
      *
FROM TableA a
LEFT JOIN TableB b ON a.CustomerID = b.CustomerID
                  AND a.WorkOrder = b.WorkOrder
WHERE b.CustomerID IS NULL
AND a.[Date] > DATEADD(DD, -3, GETDATE())
;

or another "semi-join" alternative by using NOT EXISTS:

SELECT
      *
FROM TableA a
WHERE NOT EXISTS (
      SELECT NULL
      FROM TableB b
      WHERE a.CustomerID = b.CustomerID
      AND a.WorkOrder = b.WorkOrder
      )
AND a.[Date] > DATEADD(DD, -3, GETDATE())
;

Note that the subquery used for |NOT| EXISTS doesn't have to return any values through the select clause. Some prefer to use "select 1" or "select *" when using EXISTS but in truth it really doesn't matter which is used. Using "select NULL" is my preference.

You can learn more about these alternatives by inspecting execution plans, see http://sqlfiddle.com/#!6/04064/2 for example.

Original query:enter image description here "Left excluding join" alternative:enter image description here "Not Exists" alternative:enter image description here

Paul Maxwell
  • 33,002
  • 3
  • 32
  • 51
  • Thank you this helps clarify, I didn't know there was a name for it! Can you explain a little more on the WHERE NOT EXISTS (SELECT NULL? If the sub query returns a NULL, will that be considered "EXISTING" then? I like this approach better as my TableA select absolutely cannot return any dups due to join issues, TableA and TableB actually have around 300 columns, and is client data which is frequently un-trustworthy when directly joined together. – Xenophage Aug 11 '17 at 00:30
  • 1
    I did add a note regarding "select NULL" already. Not much else I can say, the point is it is treated as *"almost a join"* (a "left anti semi join" in fact) so the select clause of the subquery isn't really considered, just the where clause is sufficient to determine if a match exists or not. & Also: Note that the "left excluding join" won't multiply the rows of the result because only the UNMATCHED rows are returned. – Paul Maxwell Aug 11 '17 at 00:34