5

We met a strange situation on SQL Server 2008 (SP1) - 10.0.2531.0 (X64) - Win2008 SP2 (X64).

Here is a one heavy query:

select t1.id, t2.id 
from t1, t2
where 
     t1.id = t2.ext_id
     and isnull(t1.vchCol1, 'Null') = isnull(t2.vchCol1, 'Null')
     and isnull(t1.vchCol2, 'Null') = isnull(t2.vchCol2, 'Null')
     .... and about 10 more comparisons with Isnull

UPD: All columns in comparison (except IDs) are varchar(~30...200)
T1 is ~130mln rows, T2 is ~300k rows.

These query on rather big Dev server run ~5 hours - this is slow, but what we can do?

And while we investigated possible ways of optimisation - we found, that changing "isnull" to "coalesce" in query above gives double performance gain - and query now run for ~2 hours

UPD: When we remove all ISNULL checks and use just t1.vchCol1 = t2.vchCol1 the query finishes after 40mins.

Question is: Is this known behavior and we should avoid using IsNull everywhere?

zmische
  • 809
  • 3
  • 13
  • 23
  • What datatype is vchCol1 etc? – gbn May 27 '11 at 13:05
  • I would think `(t1.vchCol1 = t2.vchCol1)` would be better for performance than running a function (which likes to force a table scan), but it requires `ANSI_NULLS` to be set to `OFF`. – cHao May 27 '11 at 13:11
  • 2
    Have you looked at the query execution plan? – David May 27 '11 at 13:12
  • We cant use ANSI_NULLS=OFF. Query plan was simple: two Table scans -> Compute Scalar (for each Table)-> hash join -> Insert. – zmische May 27 '11 at 13:20
  • 3
    I'm sure there is nothing you can do about it, but this is just one more reason NULLs are evil! – KM. May 27 '11 at 13:21
  • We'll try to eliminate `Nulls` via sql logic (populate columns with GAP values instead of Nulls) and use "IS NULL OR" construction advised below. – zmische May 27 '11 at 13:26
  • 1
    @zmische - Did you get to the bottom of the reason for the discrepancy between the two? – Martin Smith May 27 '11 at 15:26

4 Answers4

10

I wonder if you'd see an improvement by splitting the cases out explicitly:

...
AND ((t1.vchCol1 = t2.vchCol1) OR (t1.vchCol1 IS NULL AND t2.vchCol1 IS NULL))
AND ((t1.vchCol2 = t2.vchCol2) OR (t1.vchCol2 IS NULL AND t2.vchCol2 IS NULL))
...
Joe Stefanelli
  • 132,803
  • 19
  • 237
  • 235
  • Agreed. This pattern [is understood by SQL Server](http://stackoverflow.com/questions/4998151/how-do-i-pass-a-dbnull-value-to-a-parameterized-select-statement/4998247#4998247). When looking at this for the linked answer (slightly different as comparing with a parameter rather than another column though) I found that it just appears as a straightforward equality in the plan. – Martin Smith May 27 '11 at 13:17
  • So you think it is normal behavior, that `coalesce` is 2x faster than `Isnull`. And all we can do - is to split cases so that SQL clearly understand how to optimize? So it should not be considered as a performance Bug in SQL Server? – zmische May 27 '11 at 13:36
  • @zmische - I would have thought both `coalesce` and `isnull` expressions are unsargable and suboptimal. Not sure why you would see a difference in performance between the two. How do the plans differ? Does one of them have more accurate row estimates than the other? Is the order of the tables into the hash join the same or reversed? What does `SET STATISTICS IO ON` and `SET STATISTICS TIME ON` say about logical reads and CPU time? Maybe the menory grant isn't sufficient in the `ISNULL` version for the hash join because of different estimates. – Martin Smith May 27 '11 at 13:38
  • @Martin - thx for questions. I don't have exact statistics - another team using that server now. But I'll try to address that for the next runs. – zmische May 27 '11 at 13:44
  • @zmische - I'm thinking that insufficient memory grant and hash spills is quite a likely candidate. You should be able to see this from the estimated plans without needing to run the whole query just by looking at the estimated amount of rows and row size for both. – Martin Smith May 27 '11 at 13:46
3

Most of the articles you'll find on this subject seem to contradict this. ISNULL is (marginally) faster than COALESCE.

Differences between ISNULL and COALESCE

COALESCE basically translates to CASE expression and ISNULL is a built-in implemented in the database engine.
...
This will make a performance difference and queries with COALESCE often fare worse here.

ISNULL vs. COALESCE

I ran these tests several times on a few different servers, and ISNULL appears to pretty consistently out-perform COALESCE by an average of 10 or 12 percent. But that's the difference between 6 seconds and 5.3 seconds (the approximate average runtimes per test on my servers), over the course of a million exections. Hardly worth the functionality and standards compliance sacrifice, at least in the scenarios I use these functions for.

COALESCE vs ISNULL vs IS NULL OR

the best performer is IS NULL OR case, while the difference between all 3 of them is minor.

Andriy M
  • 76,112
  • 17
  • 94
  • 154
Lieven Keersmaekers
  • 57,207
  • 13
  • 112
  • 146
  • But in our case after switching to COALESCE we got a boost from 5h to 2h. What's the reason? As for me - ISNULL should be faster, that is why I'm so confused. – zmische May 27 '11 at 13:09
  • 2
    Just to add, In the case `ISNULL(subquery,X)` or `COALESCE(subquery,X)` the second one evaluates the subquery twice. – Martin Smith May 27 '11 at 13:14
2

You may want to consider adding a computed column to each table that holds a checksum value. Then, create an index on the ID column and the checksum value, and finally use the checksum value in the join. Something like this:

Alter Table T1 Add CheckSumId As CHECKSUM(vchCol1, vchCol2, vchCol3)
Alter Table T2 Add CheckSumId As CHECKSUM(vchCol1, vchCol2, vchCol3)

Create NonClustered index idx_T1_Checksum On T1(id, CheckSumId)
Create NonClustered index idx_T2_Checksum On T2(ext_id, CheckSumId)

Then your query would become...

select t1.id, t2.id 
from t1 Inner Join t2
       On t1.id = t2.ext_id
       And T1.CheckSumId = T2.CheckSumId
where  isnull(t1.vchCol1, 'Null') = isnull(t2.vchCol1, 'Null')
     and isnull(t1.vchCol2, 'Null') = isnull(t2.vchCol2, 'Null')

This will, of course, use extra index space, but it's simply 2 integers which should be very efficient. There will also be a performance penalty for each insert, update and delete because another index needs to be maintained. However, I suspect this will have a large impact on performance.

George Mastros
  • 24,112
  • 4
  • 51
  • 59
1

I realize this is a year later, but...

For this kind of column-by-column comparison, you might consider using EXCEPT. Also, EXCEPT treats NULL like another value instead of "It could be anything!", as I like to call it.

"When you compare rows for determining distinct values, two NULL values are considered equal." --from http://msdn.microsoft.com/en-us/library/ms188055.aspx