8

I use Microsoft SQL Server 2008 (SP1, x64). I have two queries that do the same, or so I think, but they are have completely different query plans and performance.

Query 1:

SELECT c_pk
FROM table_c
WHERE c_b_id IN (SELECT b_id FROM table_b WHERE b_z = 1)
  OR  c_a_id IN (SELECT a_id FROM table_a WHERE a_z = 1)

Query 2:

SELECT c_pk
FROM table_c
LEFT JOIN (SELECT b_id FROM table_b WHERE b_z = 1) AS b ON c_b_id = b_id
LEFT JOIN (SELECT a_id FROM table_a WHERE a_z = 1) AS a ON c_a_id = a_id
WHERE b_id IS NOT NULL
  OR  a_id IS NOT NULL

Query 1 is fast as I would expect, whereas query 2 is very slow. The query plans look quite different.

I would like query 2 to be as fast as query 1. I have software that uses query 2, and I cannot change that into query 1. I can change the database.

Some questions:

  • why are the query plans different?
  • can I "teach" SQL Server somehow that query 2 is equal to query 1?

All tables have (clustered) primary keys and proper indexes on all columns:

CREATE TABLE table_a (
  a_pk   int NOT NULL PRIMARY KEY,
  a_id   int NOT NULL UNIQUE,
  a_z    int
)
GO
CREATE INDEX IX_table_a_z ON table_a (a_z)
GO

CREATE TABLE table_b (
  b_pk   int NOT NULL PRIMARY KEY,
  b_id   int NOT NULL UNIQUE,
  b_z    int
)
GO
CREATE INDEX IX_table_b_z ON table_b (b_z)
GO

CREATE TABLE table_c (
  c_pk   int NOT NULL PRIMARY KEY,
  c_a_id int,
  c_b_id int
)
GO
CREATE INDEX IX_table_c_a_id ON table_c (c_a_id)
GO
CREATE INDEX IX_table_c_b_id ON table_c (c_b_id)
GO

The tables are not modified after filling initially. I'm the only one querying them. They contains millions of records (table_a: 5M, table_b: 4M, table_c: 12M), but using only 1% gives similar results.

Edit: I tried adding FOREIGN KEYs for c_a_id and c_b_id, but that only made query 1 slower...

I hope someone can have a look at the query plans and explain the difference.

Michel de Ruiter
  • 7,131
  • 5
  • 49
  • 74
  • What is the motivation for this? `IN/EXISTS` is generally more efficient than `OUTER JOIN ... NULL` in SQL Server and the first query seems clearer to me so why not just use the first one? – Martin Smith Mar 13 '12 at 11:52
  • 2
    @Martin "I have software that uses query 2, and I cannot change that" – Michel de Ruiter Mar 13 '12 at 11:55
  • In general the queries are not the same as the Join can bring in duplicate rows whereas the semi join doesn't. Though haven't checked if you have any constraints that prevent this yet. – Martin Smith Mar 13 '12 at 11:59
  • @Martin `a_id` and `b_id` are unique so the joins will not duplicate rows. – Michel de Ruiter Mar 13 '12 at 12:03
  • I'm 99% convinced that they do appear to have the same semantics in that case then. But that doesn't mean the QO has the necessary transformation rules to convert the one to the other. Quite often the way the query is written can affect the plan. Have you tried using a plan quide on the query (with the `USE PLAN` hint) to try and get the second one to use the plan from the first? – Martin Smith Mar 13 '12 at 12:14
  • @Martin I have no experience using plan guides. Looks like those will not help me anyway as I have many more columns similar to `a_z` and `b_z` (say `a_y` and `b_y`) with the exact same issue. – Michel de Ruiter Mar 13 '12 at 12:20
  • I just tried a plan guide this end and it refused to run it so SQL Server doesn't seem able to generate that plan for it. – Martin Smith Mar 13 '12 at 12:31

3 Answers3

1

Join are slower, let me say by design. First query uses a sub-query (cacheable) to filter records so it'll produce less data (and less accesses to each table).

Did you read these:

What I mean is that with IN the DB can do better optimizations like removing duplicates, stop at first match and similar (and these are from school memories so I'm sure it'll do much better). So I guess the question isn't why QP is different but how smart how deep optimizations can go.

Adriano Repetti
  • 65,416
  • 20
  • 137
  • 208
0

You are comparing non equivalent queries also you are using left join in quite unusual way. Generally if yours intention was to select all entries in table_c which has linked records either in table_a or table_b you should use exists statement:

SELECT c_pk 
FROM table_c 
WHERE  Exists( 
 SELECT 1
 FROM table_b 
 WHERE b_z = 1 and c_b_id = b_id 
) OR  Exists( 
 SELECT 1 
 FROM table_a 
 WHERE a_z = 1 and c_a_id = a_id
) 
marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
Vitaliy Kalinin
  • 1,791
  • 12
  • 20
  • 1
    If you post code, XML or data samples, **PLEASE** highlight those lines in the text editor and click on the "code samples" button ( `{ }` ) on the editor toolbar to nicely format and syntax highlight it! – marc_s Mar 13 '12 at 12:00
0

Since you can't change the query, at least you can improve the query's environment.

  1. Highlight your query, right-click on it in SSMS and select "Analyze Query in Database Engine Tuning Advisor."
  2. Run the analysis to find out if you need any additional indexes or statistics built.
  3. Heed SQL Server's advice.
JeffO
  • 7,957
  • 3
  • 44
  • 53
  • I don't see any "Tuning Advisor" in my SSMS. The estimated execution plan did not show any missing indexes. All columns are indexed already, what do you think there is to add? – Michel de Ruiter Mar 13 '12 at 13:31
  • 1
    @MicheldeRuiter - Doubt there's anything you can add. You'll need to rewrite the query or live with the performance. SQL Server doesn't seem able to convert the `OR` to a `UNION` in this case so it is processing all rows in `table_c` outer joined onto the other two tables then doing the filter on the end. – Martin Smith Mar 13 '12 at 13:52
  • 1
    You probably have the free version where this is not available or you didn't install it. – JeffO Mar 15 '12 at 14:30