4

I've been looking at trying to optimise (or at least change) some EF code in C# to use stored procedures, and found what seems to be an anomaly (or something new to me) when finding rows matching a constant list. The typical manually-generated short query would be something like...

SELECT Something FROM Table WHERE ID IN (one, two, others);

We had an EF query that we were replacing with a stored procedure call, so I looked at the output, saw that it was complex and thought my simpler query (similar to the above) would be better. It wasn't. Here is a quick demo that reproduces this.

Can anyone explain why the execution plans for the final version - with the

...WHERE EXISTS(... (SELECT 1 AS X) AS Alias UNION ALL...) AS Alias...)

construct is better - seemingly because it omits the costly SORT operation, even though the plan includes TWO index scans rather than the one of the simpler query.

Here's a self-contained example script (I hope)...

USE SandBox;  -- a dummy database, not a live one!
-- create our dummy table, dropping first if it exists
IF EXISTS (SELECT NULL FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_NAME = 'Test')
    DROP TABLE Test;
CREATE TABLE Test (Id INT IDENTITY(1,1) NOT NULL PRIMARY KEY, FormId INT NOT NULL, DateRead DATE NULL);
-- populate with some data
INSERT INTO Test VALUES (1, NULL), (1, GETDATE()), (1, NULL), (4, NULL), (5, NULL), (6, GETDATE());

-- Simple query that I might typically use
-- how many un-read entries are there for a set of 'forms' of interest, 1, 5 and 6
-- (we're happy to omit forms with none)
SELECT  T.FormId, COUNT(*) AS TheCount
  FROM  Test AS T
 WHERE  T.FormId IN (1, 5, 6)
   AND  T.DateRead IS NULL
 GROUP BY T.FormId;

-- This is the first step towards the EF-generated code
-- using an EXISTS gives basically the same plan but with constants
SELECT  T.FormId, COUNT(*) AS TheCount
  FROM  Test T
 WHERE  EXISTS (    SELECT NULL
                      FROM (VALUES (1), (5), (6)
                            ) AS X(FormId) 
                     WHERE X.FormId = T.FormId
               )
   AND  T.DateRead IS NULL
 GROUP BY T.FormId;

-- A step closer, using UNION ALL instead of VALUES to generate the 'table'
-- still the same plan
SELECT  T.FormId, COUNT(*) AS TheCount
  FROM  Test T
 WHERE  EXISTS (    SELECT NULL
                      FROM (    SELECT 1 
                                UNION ALL
                                SELECT 5 
                                UNION ALL
                                SELECT 6 
                            ) AS X(FormId) 
                     WHERE X.FormId = T.FormId
               )
   AND  T.DateRead IS NULL
 GROUP BY T.FormId;

-- Now what the EF actually generated (cleaned up a bit)
-- Adding in the "FROM (SELECT 1 as X) AS alias" changes the execution plan considerably and apparently costs less to run
SELECT  T.FormId, COUNT(*) AS TheCount
  FROM  Test T
 WHERE  EXISTS (    SELECT NULL
                      FROM (    SELECT 1 FROM (SELECT 1 AS X) AS X1
                                UNION ALL
                                SELECT 5 FROM (SELECT 1 AS X) AS X2
                                UNION ALL
                                SELECT 6 FROM (SELECT 1 AS X) AS X3
                            ) AS X(FormId) 
                     WHERE X.FormId = T.FormId
               )
   AND  T.DateRead IS NULL
 GROUP BY T.FormId;

Can anyone help me to understand why and if there is a benefit in wider use for this kind of query format?

I looked around for anything special in (SELECT 1 AS X) stuff and though many show it as being common in EF output, I couldn't see anything about this particular apparent benefit.

Thanks in advance,

Keith

  • Great question! But I am voting to close to migrate to dba.stackexchange.com instead – SqlZim Nov 07 '17 at 22:12
  • 2
    I can't comment on _why_ it produces the particular execution plan it does, but what it's doing is a slightly inferior version of doing three separate select statements with a union all (and the particular structure of the EF output seems to force the execution plan to do this). e.g. `select formid, count(*) as thecount from test where formid = 1 and dateread is null union all select formid, count(*) from test where formid = 5...` - which might be the most efficient form of the given query. – ZLK Nov 07 '17 at 22:37
  • Interesting question. My observations: 1. the last two queries do not undergo parameterisation; 2. when you increase row count in Test table, plans change but still the last query produces one of the better plans (on SQL 2014 Dev). To easily change the number of rows add `GO 1000` below `INSERT INTO Test VALUES(...)` – Alex Nov 07 '17 at 23:44
  • Getting three full clustered index scans instead of a single one and a sort isn't really an improvement despite the fact that it may be given a lower estimated cost. – Martin Smith Nov 08 '17 at 07:08
  • Thanks folks - yes, I was wondering how it would scale - both for more 'values of interest' and more rows in the table. In the actual live case there are really only a dozen rows or so, so the whole question is probably moot, since the performance of the worst possible query would still be acceptable in the real world... but of course I'm looking to learn and maybe find more efficient ways of coding similar things in 'bigger' situations! Thanks for the comments, folks, I'll look some more – Keith Fearnley Nov 08 '17 at 10:17

1 Answers1

1

The predicates behind each of the Index scans in the last of those queries is a a range id >= 1 and id <= 6 and DateRead IS NULL

range between 1 and 6

I did add another "select 1" and it did create another Index Scan. It appears that each (select 1) is literally treated as a table in its own right despite the UNION ALLs to form it into into a single table.

Whereas in all prior queries the predicates for the index scan are a set of DateRead IS NULL followed by ORs

1 or 5 or 6

Thought I would add this into the mix:

declare @tmp table (formid int not null primary key)
insert into @tmp values (1),(5),(6);

SELECT  T.FormId, COUNT(*) AS TheCount
  FROM  Test T
 WHERE  EXISTS (    SELECT NULL
                      FROM @tmp X
                     WHERE X.FormId = T.FormId
               )
   AND  T.DateRead IS NULL
 GROUP BY T.FormId;

But that also included a sort.

dbfiddle.uk enables access to the full showplan xml (somewhat tediously) so if interested: a dbfiddle is here

Paul Maxwell
  • 33,002
  • 3
  • 32
  • 51
  • Thanks, that's interesting. I was wondering, given the comparative costs the plan gives, whether those scans on the last query were artefacts and not actually executed that many times. I'm also not sure (not being up on details of the plans) where the sorting occurs in the last query, since it doesn't have the explicit sort... and if we can take advantage of that approach in other places to reduce sort effort? – Keith Fearnley Nov 08 '17 at 10:14
  • I was wondering the same thing too. Might look again at some point. Not sure about the sorting either. It may be worth asking this question at dba.stackexchage.com as already suggested. – Paul Maxwell Nov 08 '17 at 10:21
  • Thanks, didn't want to double-post - is there a way to move, or do I just close here and re-post there? – Keith Fearnley Nov 08 '17 at 12:58