SQL Server LEFT JOIN fails to match rows without JOIN hint

Question

I have what appears to be a corrupt index?

Here is what is happening. I have two table-functions which the first is a set of cases and the second is a set of aware dates. These two sets have a 1 (case) to 0 or 1 (aware date) relationship. Normally I query them like;

SELECT c.CaseID, a.AwareDate  
FROM Cases(@date) AS c  
LEFT JOIN AwareDates(@date) AS a ON c.CaseID = a.CaseID;

The trouble is that not all of the rows from AwareDates which match seem to be JOIN'd. If I add a join hint, they then do. say;

SELECT c.CaseID, a.AwareDate  
FROM Cases(@date) AS c  
LEFT MERGE JOIN AwareDates(@date) AS a ON c.CaseID = a.CaseID;

What I notice from the query plan is that adding the join hint adds a sort of the AwareDate data before the join which is not there otherwise. Also, the query planner flips the join to a RIGHT OUTER JOIN when there is no hint, and of course keeps the LEFT JOIN where the hint is present.

I've done the following with no errors detected;

DBCC UPDATEUSAGE (0) WITH INFO_MESSAGES, COUNT_ROWS;  
EXECUTE sp_updatestats 'resample';  
DBCC CHECKDB (0) WITH ALL_ERRORMSGS, EXTENDED_LOGICAL_CHECKS;

I'm stumped... any ideas?

Here are the UDF definitions

ALTER FUNCTION dbo.Cases( @day date ) RETURNS TABLE
WITH SCHEMABINDING
AS RETURN (
SELECT 
    CaseID -- other 42 columns ommitted
FROM (
    SELECT
        ROW_NUMBER() OVER (PARTITION BY CaseID ORDER BY UpdateDate DESC, UpdateNumber DESC) AS RecordAge,
        CaseID, 
        Action
    FROM
        dbo.CaseAudit
    WHERE
        convert(date,UpdateDate) <= @day
    ) AS History
WHERE
    RecordAge = 1                    -- only the most current record version
    AND isnull(Action,'') != N'DEL'  -- only include cases that have not been deleted
)

ALTER FUNCTION dbo.AwareDates( @day date ) RETURNS TABLE
WITH SCHEMABINDING
AS RETURN (
WITH 
    History AS (
        SELECT row_number() OVER (PARTITION BY CaseID, ContactID ORDER BY UpdateDate DESC, UpdateNumber DESC) AS RecordAge,
            CaseID, InfoReceived, ReceiveDate, ResetClock, Action
        FROM dbo.ContactLogAudit WITH (NOLOCK)
        WHERE convert(date,UpdateDate) <= @day
        ),
    Notes AS (
        SELECT 
            CaseID,
            convert(date,ReceiveDate,112) AS ReceiveDate,
            ResetClock
        FROM History 
        WHERE RecordAge = 1                -- only the most current record version
        AND isnull(Action,'') != N'DEL'    -- only include notes that have not been deleted
        AND InfoReceived = N'Y'            -- only include notes that have Info Rec'd checked 
        AND len(ReceiveDate) = 8 AND isnumeric(ReceiveDate) = 1 AND isdate(ReceiveDate) = 1 -- only include those with a valid aware date
        ),
    Initials AS (
        SELECT CaseID, min(ReceiveDate) AS ReceiveDate
        FROM Notes 
        GROUP BY CaseID
        ),
    Resets AS (
        SELECT CaseID, max(ReceiveDate) AS ReceiveDate
        FROM Notes 
        WHERE ResetClock = N'Y'
        GROUP BY CaseID
        )
SELECT 
    i.CaseID                              AS CaseID,
    i.ReceiveDate                         AS InitialAwareDate, -- the oldest valid aware date value (must have AE Info Reveived checked and a received date)
    coalesce(r.ReceiveDate,i.ReceiveDate) AS AwareDate  -- either the newest valid aware date value with the Reset Clock checked, otherwise the initial aware date value
FROM Initials AS i
LEFT JOIN Resets AS r 
    ON i.CaseID = r.CaseID
);

I have further found that if I drop the "WITH (NOLOCK)" table hint, I get correct results. Also if add a join hint to the AwareDates UTF or even add a COLLATE Latin1_General_BIN on the LEFT JOIN relation between Initials and Resets.

Query plan row counts -- without join hint (broken)

Cases { Actual: 25,891, Estimate: 19,071.9 }
AwareDates { Actual: 24,693, Estimated: 1,463.09 }
- Initials { Actual: 24,693, Estimated: 1,463.09 }
- Rests { Actual: 985, Estimated: 33.2671 }
AwareDates matches 8,108 of the Cases rows in the join'd result-set

Query plan row counts -- with join hint (working)

Cases { Actual: 25,891, Estimate: 19,071.9 }
AwareDates { Actual: 24,673, Estimated: 1,837.67 }
- Initials { Actual: 24,673, Estimated: 1,837.67 }
- Rests { Actual: 982, Estimated: 42.6238 }
AwareDates matches 24,673 of the Cases rows in the join'd result-set

I have further whittled down the scope of the issue. I can;

SELECT * FROM AwareDate(@date);

and

SELECT * FROM AwareDate(@date) ORDER BY CaseID;

With different row counts.

Can you share the definition of your `UDF`s? The biggest red flag to me is that you are querying function output, not tables. — JNK, Mar 20 '12 at 19:40
+1 for a genuine mystery. I can say I've never tried joining two functions before. Since you're asking about a bad index, I assume the functions are getting data from tables. Can you query the tables directly instead of going through the functions? Mayhap in a view? — Nick Vaccaro, Mar 20 '12 at 19:48
I've tried a few other ways of quering for the data, and if I place the table-function output into a temp table and then join the two temp tables with no hint, it works as expected. Also, if I (SELECT TOP 999999 * FROM AwareDate(@date) ORDER BY CaseID) AS a it works without the hint. — Dennis Allen, Mar 20 '12 at 19:51
The table function for cases is looking at audit history and producing a As-Of at the given @date. The AwareDate is looking at a notes table and both performing the As-Of process as well as finding the "first" and "last" notes tagged for follow-up or as new information. It seemed like clutter. Also, I have been using this technique for 3+ years so I agree that the trouble is that the query planner is trying to reach into the AwareDate UTF and incorrectly filtering its output. — Dennis Allen, Mar 20 '12 at 19:58
What happens when you run DBCC CHECKTABLE table_name index_id ? — paparazzo, Mar 20 '12 at 21:29
Can you supply the definition of the functions? Also would be interesting to see the execution plan that doesn't work. — Martin Smith, Mar 20 '12 at 22:39
@Norla - It's possibly only a genuine mystery because the OP hasn't provided sufficient information yet (not even told us if the TVFs are inline or multi statement). I can think of some circumstances in which this could happen (e.g. if function is non determinstic) — Martin Smith, Mar 21 '12 at 09:29
@Blam - DBCC CHECKTABLE does not find any issues. I've even fully dropped and recreated the indexs being used. — Dennis Allen, Mar 21 '12 at 13:46
@MartinSmith - I agree that not having those definitions is a problem and if I get permission to post them I will. They are both in-line. — Dennis Allen, Mar 21 '12 at 13:52
@MartinSmith - I am not sure how to post the execution plan. Something that is interesting is that the plan without the hint swaps the LEFT JOIN's to RIGHT JOIN's and of course keeps and enforces the LEFT JOIN if you add a query hint. Seems a red harring though since adding COLLATE or removing the NOLOCK also fix the issue but the query plans still use the RIGHT JOIN. — Dennis Allen, Mar 21 '12 at 15:24
@DennisAllen - You can paste the XML up to a site like pastebin or gist. — Martin Smith, Mar 21 '12 at 15:28
@DennisAllen - But from the definitions I see you will get a non deterministic result if any ties for `CaseAudit(CaseID, UpdateDate, UpdateNumber)` and similar issue for the other function. This means that 2 separate invocations of the function won't necessarily return the same results and it can depend on the access methods used. — Martin Smith, Mar 21 '12 at 15:34
RE: Edit it would be better if we could see actual plan XML really. — Martin Smith, Mar 21 '12 at 18:11
Since this is also potentially a bug can we get specific version of SQL (select @@version)? See - http://social.msdn.microsoft.com/forums/en-US/sqldatabaseengine/thread/7164c0ac-b657-4786-9736-05c1581e6698/ — EBarr, Mar 23 '12 at 01:19

score 2 · Accepted Answer · edited Mar 29 '12 at 15:10

You don't specify the specific version of SQL (@@version), but this seems suspiciously like a bug that was fixed in Cumulative Update 6 for SQL 2008 R2 (apparently it also applies to SQL 2008).

KB 2433265
FIX: You may receive an incorrect result when you run a query that uses the ROW_NUMBER function together with a left outer join in SQL Server 2008

The example in the article specifies DISTINCT. The article, however, is worded ambiguously -- it's not clear whether you NEED a distinct or if DISTINCT is one of the triggers.

Your example doesn't have a distinct like the article, but it appears modified for the sake of asking the question(i.e. 42 columns missing). Is there a distinct? Also in the AwareDates udf by the time i get down to the Initials CTE you do a GROUP BY which could have the same effect as a DISTINCT.

UPDATE

@Dennis from your comment I still can't tell if you're using SQL 20080 or 2008 R2.

If you're running 2008, the KB article says "The fix for this issue was first released in Cumulative Update 11 for SQL Server 2008 Service Pack 1." So, post SP1.

On the other hand, if you're using SQL 2008 R2, you are correct that this was fixed in CU 6, which was part of SP1. But this bug appears to have resurfaced. Look at Cumulative update package 4 for SQL Server 2008 R2 Service Pack 1 -- released post SP1.

970198      FIX: You receive an incorrect result when you run a 
            query that uses the row_number function in SQL Server 2008 
            or in SQL Server 2008 R2

In the associated KB article MS dropped the reference to distinct:

Consider the following scenario. You run a query against a table that has a 
clustered index in Microsoft SQL Server 2008 or in Microsoft SQL Server 2008
R2. In the query, you use the row_number function. In this scenario, you 
receive an incorrect result when a parallel execution plan is used for the 
query. If you run the query many times, you may receive different results.

This seems to confirm my earlier reading of KB 2433265 -- the phrasing suggests distinct is just one of many conditions that can cause the behavior. It seems that a parallel execution plan is the culprit this time around.

This issue was first fixed in cumulative update 6 (pre-sp1), and I have SP1 installed, so I would think the fix would be included. Also, there is no distinct in the removed columns from cases. — Dennis Allen, Mar 27 '12 at 17:30
@DennisAllen just to be completely clear you need either `10.00.2804` or later for 2008 or `10.50.1765.0` for 2008 R2 — Conrad Frix, Mar 27 '12 at 18:44
I am thinking that [http://support.microsoft.com/kb/2546901] is more likely my issue. I am testing that now. — Dennis Allen, Mar 27 '12 at 21:53
To me those seem like 2 sides of the same coin -- Ranking functions & parallel execution plans. — EBarr, Mar 27 '12 at 22:05
I am marking this the answer because the fix I got from MS support was to turn on trace flag 4134 and you can see in the notes for this update, that you need to turn on that same trace flag to activate the fix. So indeed these two seem closely related. — Dennis Allen, Mar 28 '12 at 15:19
Further update from MS said that the changes to SORT operations which this flag turns on was incorporated in SQL 2012 as default behavior and that to date no one has reported a problem with turning on the new sort logic in SQL 2008 or 2008 R2. — Dennis Allen, Mar 29 '12 at 16:10

SQL Server LEFT JOIN fails to match rows without JOIN hint

1 Answers1