133

Compare these 2 queries. Is it faster to put the filter on the join criteria or in the WHERE clause. I have always felt that it is faster on the join criteria because it reduces the result set at the soonest possible moment, but I don't know for sure.

I'm going to build some tests to see, but I also wanted to get opinions on which would is clearer to read as well.

Query 1

SELECT      *
FROM        TableA a
INNER JOIN  TableXRef x
        ON  a.ID = x.TableAID
INNER JOIN  TableB b
        ON  x.TableBID = b.ID
WHERE       a.ID = 1            /* <-- Filter here? */

Query 2

SELECT      *
FROM        TableA a
INNER JOIN  TableXRef x
        ON  a.ID = x.TableAID
        AND a.ID = 1            /* <-- Or filter here? */
INNER JOIN  TableB b
        ON  x.TableBID = b.ID

EDIT

I ran some tests and the results show that it is actually very close, but the WHERE clause is actually slightly faster! =)

I absolutely agree that it makes more sense to apply the filter on the WHERE clause, I was just curious as to the performance implications.

ELAPSED TIME WHERE CRITERIA: 143016 ms
ELAPSED TIME JOIN CRITERIA: 143256 ms

TEST

SET NOCOUNT ON;

DECLARE @num    INT,
        @iter   INT

SELECT  @num    = 1000, -- Number of records in TableA and TableB, the cross table is populated with a CROSS JOIN from A to B
        @iter   = 1000  -- Number of select iterations to perform

DECLARE @a TABLE (
        id INT
)

DECLARE @b TABLE (
        id INT
)

DECLARE @x TABLE (
        aid INT,
        bid INT
)

DECLARE @num_curr INT
SELECT  @num_curr = 1
        
WHILE (@num_curr <= @num)
BEGIN
    INSERT @a (id) SELECT @num_curr
    INSERT @b (id) SELECT @num_curr
    
    SELECT @num_curr = @num_curr + 1
END

INSERT      @x (aid, bid)
SELECT      a.id,
            b.id
FROM        @a a
CROSS JOIN  @b b

/*
    TEST
*/
DECLARE @begin_where    DATETIME,
        @end_where      DATETIME,
        @count_where    INT,
        @begin_join     DATETIME,
        @end_join       DATETIME,
        @count_join     INT,
        @curr           INT,
        @aid            INT

DECLARE @temp TABLE (
        curr    INT,
        aid     INT,
        bid     INT
)

DELETE FROM @temp

SELECT  @curr   = 0,
        @aid    = 50

SELECT  @begin_where = CURRENT_TIMESTAMP
WHILE (@curr < @iter)
BEGIN
    INSERT      @temp (curr, aid, bid)
    SELECT      @curr,
                aid,
                bid
    FROM        @a a
    INNER JOIN  @x x
            ON  a.id = x.aid
    INNER JOIN  @b b
            ON  x.bid = b.id
    WHERE       a.id = @aid
        
    SELECT @curr = @curr + 1
END
SELECT  @end_where = CURRENT_TIMESTAMP

SELECT  @count_where = COUNT(1) FROM @temp
DELETE FROM @temp

SELECT  @curr = 0
SELECT  @begin_join = CURRENT_TIMESTAMP
WHILE (@curr < @iter)
BEGIN
    INSERT      @temp (curr, aid, bid)
    SELECT      @curr,
                aid,
                bid
    FROM        @a a
    INNER JOIN  @x x
            ON  a.id = x.aid
            AND a.id = @aid
    INNER JOIN  @b b
            ON  x.bid = b.id
    
    SELECT @curr = @curr + 1
END
SELECT  @end_join = CURRENT_TIMESTAMP

SELECT  @count_join = COUNT(1) FROM @temp
DELETE FROM @temp

SELECT  @count_where AS count_where,
        @count_join AS count_join,
        DATEDIFF(millisecond, @begin_where, @end_where) AS elapsed_where,
        DATEDIFF(millisecond, @begin_join, @end_join) AS elapsed_join
Jon Erickson
  • 112,242
  • 44
  • 136
  • 174
  • 12
    Depending on data, WHERE vs JOIN criteria can return different resultsets. – OMG Ponies Mar 24 '10 at 17:35
  • 6
    @OMG Ponies very true, but a lot of times it doesn't as well. – Jon Erickson Mar 24 '10 at 19:10
  • 2
    I would not call difference beelow 5% as a difference- they are the same. You want significance for a 2 %% difference better run the tests 1000 times to make sure iti s not just random. – TomTom Nov 22 '11 at 10:58
  • The benefit is in filtering the data before joining so if it was x.ID then you would be more likely to see improvement than with an a.ID – MikeT Mar 15 '13 at 09:10
  • 1
    possible duplicate of [SQL Filter criteria in join criteria or where clause which is more efficient](http://stackoverflow.com/questions/1401889/sql-filter-criteria-in-join-criteria-or-where-clause-which-is-more-efficient) – bluish Sep 29 '14 at 07:27
  • @OMGPonies Putting a condition in WHERE vs ON, assuming names are in scope, can only make a result set difference when there are outer joins. – philipxy Aug 31 '17 at 03:23
  • I think there are some complex or fringe cases where it does make a difference. I've had situations where the where clause query runs for minutes, but the join criteria query is instant. It's relatively rare in my experience - and I've never been able to tell what was going on. – John Mar 24 '21 at 17:56

10 Answers10

82

Performance-wise, they are the same (and produce the same plans)

Logically, you should make the operation that still has sense if you replace INNER JOIN with a LEFT JOIN.

In your very case this will look like this:

SELECT  *
FROM    TableA a
LEFT JOIN
        TableXRef x
ON      x.TableAID = a.ID
        AND a.ID = 1
LEFT JOIN
        TableB b
ON      x.TableBID = b.ID

or this:

SELECT  *
FROM    TableA a
LEFT JOIN
        TableXRef x
ON      x.TableAID = a.ID
LEFT JOIN
        TableB b
ON      b.id = x.TableBID
WHERE   a.id = 1

The former query will not return any actual matches for a.id other than 1, so the latter syntax (with WHERE) is logically more consistent.

Quassnoi
  • 413,100
  • 91
  • 616
  • 614
  • When I draw the sets I understood why the second case is more consistent. In the former query, the constraint `a.id = 1` applies only for the intersection, not the left part excluding the intersection. – FtheBuilder Jul 22 '16 at 01:33
  • 1
    In the first example there may be rows where `a.id != 1`, the other will only have rows where `a.id = 1`. – FtheBuilder Jul 22 '16 at 01:37
  • 1
    Your language is unclear. "Logically, you should make the operation that still has sense if ..." and "logically more consistent" don't make sense. Can you please rephrase? – philipxy Aug 31 '17 at 03:17
30

For inner joins it doesn't matter where you put your criteria. The SQL compiler will transform both into an execution plan in which the filtering occurs below the join (ie. as if the filter expressions appears is in the join condition).

Outer joins are a different matter, since the place of the filter changes the semantics of the query.

Remus Rusanu
  • 288,378
  • 40
  • 442
  • 569
  • 1
    So in inner joins it first calculates the filter and then joins the output of the filter with the other table or does it first join the two tables and then apply the filter? – Ashwin Jun 07 '12 at 06:51
  • @Remus Rusanu - could you please elaborate on how the semantics gets changed in case of Outer-join ? I get different results based on the position of the filter, but unable to understand why – Ananth Oct 18 '17 at 19:53
  • 3
    @Ananth with an outer join you get NULLs for all columns of the joined table where the JOIN condition does not match. Filters will not satisfy the NULL and eliminate the rows, turning the OUTER join in effect into an INNER join. – Remus Rusanu Oct 19 '17 at 07:21
  • 1
    @Ananth I achieved my required optimizations based on your comment. My change was from WHERE x.TableAID = a.ID or x.TableAID is null to ON x.TableAID = a.ID. Changing the location of the filter on an OUTER join let the compiler know to Filter then Join rather than Join then Filter. It was also able to use the index on that column because it didn't have to match Null. Query response changed from 61 seconds to 2 seconds. – Ben Gripka Sep 03 '20 at 15:38
9

As far as the two methods go.

  • JOIN/ON is for joining tables
  • WHERE is for filtering results

Whilst you can use them differently it always seems like a smell to me.

Deal with performance when it is a problem. Then you can look into such "optimisations".

Robin Day
  • 100,552
  • 23
  • 116
  • 167
3

With any query optimizer worh a cent.... they are identical.

TomTom
  • 61,059
  • 10
  • 88
  • 148
  • I'm quite sure that, with any real workload, they are not identical. If you almost don't have data, then the question is worthless. – eKek0 Mar 24 '10 at 17:44
  • 3
    Check it out under real workload. Basically - if they generate the same execution plan, they... are identical in performance. At least for normal / simple cases (i.e. not the one joining 14 tables) I am quite sure they are identical ;) – TomTom Mar 24 '10 at 18:45
3

In postgresql they are the same. We know this because if you do explain analyze on each of the queries, the plan comes out to be the same. Take this example:

# explain analyze select e.* from event e join result r on e.id = r.event_id and r.team_2_score=24;

                                                  QUERY PLAN                                                   
---------------------------------------------------------------------------------------------------------------
 Hash Join  (cost=27.09..38.22 rows=7 width=899) (actual time=0.045..0.047 rows=1 loops=1)
   Hash Cond: (e.id = r.event_id)
   ->  Seq Scan on event e  (cost=0.00..10.80 rows=80 width=899) (actual time=0.009..0.010 rows=2 loops=1)
   ->  Hash  (cost=27.00..27.00 rows=7 width=8) (actual time=0.017..0.017 rows=1 loops=1)
         Buckets: 1024  Batches: 1  Memory Usage: 9kB
         ->  Seq Scan on result r  (cost=0.00..27.00 rows=7 width=8) (actual time=0.006..0.008 rows=1 loops=1)
               Filter: (team_2_score = 24)
               Rows Removed by Filter: 1
 Planning time: 0.182 ms
 Execution time: 0.101 ms
(10 rows)

# explain analyze select e.* from event e join result r on e.id = r.event_id where r.team_2_score=24;
                                                  QUERY PLAN                                                   
---------------------------------------------------------------------------------------------------------------
 Hash Join  (cost=27.09..38.22 rows=7 width=899) (actual time=0.027..0.029 rows=1 loops=1)
   Hash Cond: (e.id = r.event_id)
   ->  Seq Scan on event e  (cost=0.00..10.80 rows=80 width=899) (actual time=0.010..0.011 rows=2 loops=1)
   ->  Hash  (cost=27.00..27.00 rows=7 width=8) (actual time=0.010..0.010 rows=1 loops=1)
         Buckets: 1024  Batches: 1  Memory Usage: 9kB
         ->  Seq Scan on result r  (cost=0.00..27.00 rows=7 width=8) (actual time=0.006..0.007 rows=1 loops=1)
               Filter: (team_2_score = 24)
               Rows Removed by Filter: 1
 Planning time: 0.140 ms
 Execution time: 0.058 ms
(10 rows)

They both have the same min and max cost as well as the same query plan. Also, notice that even in the top query the team_score_2 gets applied as a 'Filter'.

Peter Graham
  • 2,467
  • 2
  • 24
  • 29
1

Rule #0: Run some benchmarks and see! The only way to really tell which will be faster is to try it. These types of benchmarks are very easy to perform using the SQL profiler.

Also, examine the execution plan for the query written with a JOIN and with a WHERE clause to see what differences stand out.

Finally, as others have said, these two should be treated identically by any decent optimizer, including the one built into SQL Server.

3Dave
  • 28,657
  • 18
  • 88
  • 151
0

I guess that the first, because it makes a more specific filter over the data. But you should see the execution plan, as with any optimization, because it can be very different deppending on size of data, server hardware, etc.

eKek0
  • 23,005
  • 25
  • 91
  • 119
0

Is it faster? Try it and see.

Which is easier to read? The first to me looks more "correct", as the moved condition is nothing really to do with the join.

Quassnoi
  • 413,100
  • 91
  • 616
  • 614
David M
  • 71,481
  • 13
  • 158
  • 186
0

It is really unlikely that the placement of this join will be the deciding factor for performance. I am not intimately familiar with the execution planning for tsql, but it's likely that they will be optimized automatically to similar plans.

Joe Mastey
  • 26,809
  • 13
  • 80
  • 104
0

This answer is meant for those landing here for optimising queries with joins on large databases.

As most answers suggest, filtering in join criteria does make significant difference in performance. In some cases, joins can be avoided. ORMs (Object-Relational Mapping) like that of Django framework generate queries with table joins when querying fields from related tables. If we are already paginating the results, avoiding these joins altogether will make significant difference in performance.

consider a Message table which has foreign keys thread (to Thread table) sender (to User table) and topic (to Topic table)

msgs = Message.filter(content__icontains=keyword).values('id','msg_type','thread',
'thread__title','thread__created','created','sender__name','sender__email',
'attachment','permalink','is_marked','topic__title','topic__admin__name',
'topic__created','topic__description').order_by('-is_marked','-id')

The above queryset creates a query that have joins on models (tables) Topic, Thread and Sender. All that we need to get are messages containing the keyword and get fields name from Sender, title from Thread table etc.

A search along with all these joins will be way too slow.

Instead, splitting the query like the following will be much faster:

# fetching only fields from Message table
msgs = Message.filter(content__icontains=keyword).values('id','msg_type','thread','created',
'sender','attachment','permalink','is_marked','topic').order_by('-is_marked','-id')

# paginating the results
paginator = Paginator(msgs,100)
page = request.GET.get('page')
try:
    items = paginator.page(page)
except PageNotAnInteger:
    items = paginator.page(1)
except EmptyPage:
    items = paginator.page(paginator.num_pages)

# getting primary keys of records to fetch from Thread, User, Topic tables
thread_ids,sender_ids,topic_ids = set(),set(),set()
items.object_list = list(items.object_list)
for msg in items.object_list:
    thread_ids.add(msg['thread'])
    sender_ids.add(msg['sender'])
    topic_ids.add(msg['topic'])

# Fetching only the required records from Thread, User, Topic tables
thread_details = {i['id']:i for i in Thread.objects.filter(id__in=thread_ids).values('id','title','created') }
sender_details = {i[0]:i for i in User.objects.filter(id__in=sender_ids).values('id','name','email') }
topic_details = {i['id']:i for i in Topic.objects.filter(id__in=topic_ids).values('id','title','created') }

# saving the fetched data to the results
for msg in items.object_list:
    t = thread_details.get(msg['thread'],{})
    msg['thread_title'] = t.get('title')
    msg['thread_created'] = t.get('created')

    s = sender_details.get(msg['sender'],{})
    msg['sender_name'] = sender_details.get('name','')
    msg['sender_email'] = sender_details.get('email','')

    t = topic_details.get(msg['topic'],{})
    msg['topic_title'] = t.get('title')
    msg['topic_created'] = t.get('created')

print(items.object_list)

Here, the joins are avoided at database level and done with python code. This works if you are not sorting or filtering using any field in the related models

Note: this is not useful unless your table has million of records

Mohammed Shareef C
  • 3,829
  • 25
  • 35
  • This answers a completely different question, specific to the way an application is designed. If you have lookup data then yes, it's cheaper to cache it instead of retrieving it from the database every time. – Panagiotis Kanavos Aug 28 '23 at 10:59
  • @PanagiotisKanavos i just gave and example from django. The example is specific but the problem is not specific to django. when we have to search in a table with large number of entries, but also want to fetch fields from related tables, avoiding joins will be effective. Giving an example in python or django doesn't make it specfic to python or django – Mohammed Shareef C Aug 28 '23 at 11:22
  • This is not a direct answer for the question. I landed in this page bcoz of a scenario like the one i described. I was trying to optimise the query, and i ended up totally avoiding the joins. I think that all the views of this page (question) is not out of curiosity about the performance comparison of `where` clause and filter criteria on `join`. Many are trying to optimize their query – Mohammed Shareef C Aug 28 '23 at 11:32