Where condition dramatically reduces performance unexpectedly

Question

I am attempting to optimize a very large query that has a particularly bad join condition. I have reduced the larger query down to two tables to illustrate the negative behavior I am seeing.

CREATE TABLE #test1 (Id1 INT, Id2 NVARCHAR(256))
CREATE CLUSTERED INDEX PK_test1 ON #test1 (Id1, Id2)
-- Sample (1, "a|b|c|d")

CREATE TABLE #test2 (Id1 INT, Id2 NVARCHAR(256))
CREATE CLUSTERED INDEX PK_test2 ON #test2 (Id1, Id2)
-- Sample (1, "a|b")

The join I am attempting to optimize is a postfix match (query hints added to emulate the join order as these are a part of a much larger query).

DECLARE @id INT = 1

SELECT * 
FROM  #test1 t1
INNER [HASH] JOIN #test2 t2 ON t1.Id1 = t2.Id1
                            AND t1.Id2 LIKE t2.Id2 + '%'
-- This condition is the one that is causing issues
-- WHERE t1.Id1 = @id 
OPTION (FORCE ORDER)

Without the where clause, the join is able to use the HASH join hint:

However with the where clause the optimizer is unable to build that query instead decides to do this:

Substantially reducing the performance of the query.

As the where condition is reducing the data sets from both tables, I would have expected it to be strictly better, however it is completely changing the seek. It also prevents the join from being a hash or merge join, and it is not clear at all why that would be the case as it should just be reducing the search space for both tables.

The main question here is how can that where condition be having such a massive negative effect to the overall query plan when it shouldn't be changing the index access patterns

Have you looked at the 'estimated execution plan' (instead of the actual) to see what SQLServer predicts the number of rows/executions in the nested loop? compare that to actual, if they are vastly different, consider adding statistics on id1. Your Id1=1 the estimation might be just one row, where nested loop would perform better than building hash-tables. but the actual plan is showing that you have 1000 items matching id1=1; SqlServer didn't know about this distribution. — tinazmu, May 04 '23 at 00:42
Although #temp tables are temporary SQL Server sometimes caches metadata about them - including statistics used to inform query planner. Have you tried to `UPDATE STATISTICS ... WITH FULLSCAN` on your #temp tables after recreating and repopulating them? — AlwaysLearning, May 04 '23 at 01:34
https://learn.microsoft.com/en-us/archive/blogs/craigfr/implied-predicates-and-query-hints — Martin Smith, May 04 '23 at 07:02
The hash join is pointless anyway, not sure it should have been hinted in the first place — Charlieface, May 04 '23 at 08:50

Martin Smith · Accepted Answer · 2023-05-04T07:21:43.403

The reason why you see the error

Msg 8622, Level 16, State 1, Line 12 Query processor could not produce a query plan because of the hints defined in this query. Resubmit the query without specifying any hints and without using SET FORCEPLAN.

When adding the

WHERE t1.Id1 = @id

Is because

FROM  #test1 t1
INNER JOIN #test2 t2 ON t1.Id1 = t2.Id1
    AND t1.Id2 LIKE t2.Id2 + '%'
WHERE t1.Id1 = @id

is the same as

FROM  (#test1 t1 WHERE t1.Id1 = @id) t1
CROSS JOIN (#test2 t2 WHERE t2.Id1= @id) t2 
WHERE t1.Id2 LIKE t2.Id2 + '%'

So you lose the equi join that is necessary for a hash join as discussed in Implied Predicates and Query Hints (Craig Freedman).

On the face of it the hash join doesn't really give you anything as the join condition is on Id1 and all of the rows on the build side have the same value for this and fall in the same hash bucket so it has to replay all of the rows in the build side for all of the rows in the probe side anyway.

You could try using the NO_PERFORMANCE_SPOOL hint to get rid of the spool in the plan

Where condition dramatically reduces performance unexpectedly

1 Answers1