0

I've seen a great many different forms of left anti-semi join. Allow me to list and name every one that comes to mind. The following queries are intended to return every ROSTER_ID that's not used by any employee and who is the owner of that roster.

--1) NOT EXISTS, with a particular column selected in the subquery
SELECT ROSTER_ID, ROSTER_OWNER
FROM ROSTERS
WHERE NOT EXISTS (SELECT EMP_ID FROM EMPLOYEES WHERE EMP_ID = ROSTERS.EMP_ID)
--2) NOT EXISTS, with a particular column selected in the subquery and TOP (1) used
SELECT ROSTER_ID, ROSTER_OWNER
FROM ROSTERS
WHERE NOT EXISTS (SELECT TOP (1) EMP_ID FROM EMPLOYEES WHERE EMP_ID = ROSTERS.EMP_ID)
--3) NOT EXISTS, with all data selected in the subquery
SELECT ROSTER_ID, ROSTER_OWNER
FROM ROSTERS
WHERE NOT EXISTS (SELECT * FROM EMPLOYEES WHERE EMP_ID = ROSTERS.EMP_ID)
--4) NOT EXISTS, with all columns selected in the subquery and TOP (1) used
SELECT ROSTER_ID, ROSTER_OWNER
FROM ROSTERS
WHERE NOT EXISTS (SELECT TOP (1) * FROM EMPLOYEES WHERE EMP_ID = ROSTERS.EMP_ID)
--5) NOT EXISTS, but just use SELECT 1
SELECT ROSTER_ID, ROSTER_OWNER
FROM ROSTERS
WHERE NOT EXISTS (SELECT 1 FROM EMPLOYEES WHERE EMP_ID = ROSTERS.EMP_ID)
--6) NOT IN
SELECT ROSTER_ID, ROSTER_OWNER
FROM ROSTERS
WHERE EMP_ID NOT IN (SELECT EMP_ID FROM EMPLOYEES)
--7) LEFT JOIN
SELECT ROSTER_ID, ROSTER_OWNER
FROM ROSTERS
LEFT OUTER JOIN EMPLOYEES
ON EMPLOYEES.EMP_ID = ROSTERS.EMP_ID
WHERE EMPLOYEES.EMP_ID IS NULL

My question is this: Is there any objective reason - e.g. performance, backwards compatibility, portability, NULL-handling, ease of testing, extensibility, etc - to prefer any particular way of doing a left anti-semi join? I'm also interested to hear subjective reasons, e.g. style concerns or clarity, but only including them would be a non-answer.

My own research only points to the below, but it's all weak and probably subjective:

  • Microsoft's U-SQL documentation hints that they prefer to use the NOT IN version (#6 of mine) in T-SQL.
  • NULL handling with IN is always cause for concern, which is a mild reason to prefer EXISTS over it.
  • If you're seriously concerned about backwards compatibility, then I think my LEFT JOIN syntax didn't work in the 1980's.
  • Some people like to use EXCEPT, but I don't think that it generalises to cases where a column appears in only the outermost SELECT.
J. Mini
  • 1,868
  • 1
  • 9
  • 38
  • 2
    Well other than `NOT IN` just shouldn't be used, see [Should I use NOT IN, OUTER APPLY, LEFT OUTER JOIN, EXCEPT, or NOT EXISTS?](https://sqlperformance.com/2012/12/t-sql-queries/left-anti-semi-join). Anyway I don't think this is really answerable... "list the reasons" isn't solving a technical problem. In the end, you need to compare against _your_ requirements and how you judge "best" - if that's performance, there's no silver bullet, you simply need to test. If you "like" one syntax better, again, that's subjective and not a technical problem. – Aaron Bertrand May 25 '22 at 21:30
  • @AaronBertrand "*No, except for `NOT IN` sucking*" is an answer. If there's no clear reason to prefer one of the many other choices, then all an answer needs to do is say that. As for your link, that seems to argue against the `LEFT JOIN` variant for index-related reasons. I also must admit that I never even thought of using `OUTER APPLY`. – J. Mini May 25 '22 at 22:04
  • That might be an answer _for you_, but that doesn't make it a good fit _for this site_. – Aaron Bertrand May 25 '22 at 22:31
  • 1
    Easy, version 3. NOT IN is not recommended. Only version of NOT EXISTS recommended is "SELECT *" just as standard documentation denotes. SQL Server is smart enough to optimize the other NOT EXISTS versions similarly, but not all RDBM'S are. So that leaves ```JOIN``` vs ```NOT EXISTS (SELECT *...)``` In general, only use what you need. JOIN has functionality that you don't need. Personally, I have found ```NOT EXISTS``` generally performs more consistently. Anecdotal I know but it fits logically. Less opportunities for the optimizer to get confused by using the command with less functionality – Stephan May 26 '22 at 05:00
  • @AaronBertrand I truly don't see the issue. "Except for this case, which sucks [citation here], there is no reason to prefer any of these because they're usually the same [citation]" seems an entirely suitable answer for the site. Regardless, I'll stop arguing and see what the close votes conclude. – J. Mini May 26 '22 at 07:49

3 Answers3

2

No, except for NOT IN sucking

Aaron's article already is full of juicy information, and you have already spotted the danger of using NOT IN in combination with NULLS.

The only thing I can add that I didn't notice discussed is how, when using (not) exists, the things before FROM like select column or select top 1 * are essentially junk. I even posted a recommendation to sql server to have an alternative syntax of (anti) semi join table2 on <join condition>.

George Menoutis
  • 6,894
  • 3
  • 19
  • 43
0

EXISTS, NOT EXISTS return booolean TRUE/FALSE depending.The Select columns mention in EXISTS subquery do not matter,it only check whether it exists or not depending upon where cluase

For example,

SELECT ROSTER_ID, ROSTER_OWNER
FROM ROSTERS
WHERE NOT EXISTS (SELECT 1/0 FROM EMPLOYEES WHERE EMP_ID = ROSTERS.EMP_ID)

In above notice 1/0 wont throw error ,this mean what ?

So your 1),2),3),4),5) are all equal in performance and result.

LEFT JOIN :It is use when you require column from LEFT JOIN table in resultset.

KumarHarsh
  • 5,046
  • 1
  • 18
  • 22
  • Are you saying that #7 is not equivalent to the rest? – J. Mini May 31 '22 at 17:56
  • Yes.Exists clause is like inner join in some sense,it will return result like INNER join except that it wont return result from EXISTS subquery.IN and NOT IN already explain above. – KumarHarsh May 31 '22 at 18:29
  • 1
    @J. Mini Left join with null can be used as a semijoin only as long as you are certain that each row of the left table joins with at most one row of the right one. If not (eg if one roster has multiple employees), while it will indeed remove the rows you don't need, it will bring back the others multiple times. – George Menoutis Jun 01 '22 at 04:56
  • Please don't encourage that "cute" `1/0` thing. There's just no reason for it, and [here's one case where it can break](https://www.brentozar.com/archive/2020/06/when-a-columnstore-index-makes-your-query-fail/). – Aaron Bertrand Jun 01 '22 at 13:11
  • @AaronBertrand,Thanks.I wanted to prove one point in my example.Also in my example I am using inside EXISTS statement. – KumarHarsh Jun 01 '22 at 17:57
  • My point remains: it's a terrible way to prove that point. – Aaron Bertrand Jun 01 '22 at 18:47
  • @GeorgeMenoutis But that's not true for *anti* semi joins, right? I can't imagine my example #7 ever returning the same row multiple times. – J. Mini Jun 03 '22 at 16:59
  • @J. Mini true enough – George Menoutis Jun 03 '22 at 18:56
0

I don't think that any of the answers have covered everything, so here's my attempt to put all of the answers and comments together:

  • Examples 1 through 5, i.e. the EXISTS examples, are all the same. This is not easy to prove (you might find KumarHarsh's example convincing) and relies on you trusting the optimiser.
  • Don't use example #6, i.e. IN. You just can't trust it with NULLs.
  • Example #7, i.e. LEFT JOIN, can duplicate results if you use it for a normal (i.e. not anti) semijoin. There is also suspicion around how the optimiser will treat it (see the link below). Subjectively, it takes what appears to be more typing and isn't all that good at communicating your intent. Overall, it's best avoided.

In conclusion, use EXISTS. Examples #2 and #4, which use TOP, pointlessly do so, so you might as well avoid them. As for if you should use examples #1, #3, or #5, that's just a matter of taste. I think that #5 most clearly communicates intent, but that's completely a matter of opinion.

This very helpful article also suggests OUTER APPLY, but I don't think that anyone takes that seriously.

J. Mini
  • 1,868
  • 1
  • 9
  • 38