Short Version
I know which one is better - but why?
--Clustered index scan of 4.5M rows / 2.1 GB
SELECT MIN(RowNumber) FROM Transactions WHERE TransactionDate >= '20191002 04:00:00.000' OPTION(RECOMPILE)
--Covering index seek; 3 logical reads
SELECT TOP(1) RowNumber FROM Transactions WHERE TransactionDate >= '20191002 04:00:00.000' ORDER BY TransactionDate OPTION(RECOMPILE)
Long Version
I have a table as:
Transactions
RowNumber TransactionDate
(int, clustered) (datetime)
----------------- -----------------------
4592515 2019-10-07 11:12:13.690
4592516 2019-10-07 11:12:13.690
4592517 2019-10-07 11:12:18.660
4592518 2019-10-07 11:12:22.960
4592519 2019-10-07 11:13:16.587
4592520 2019-10-07 11:13:22.310
4592521 2019-10-07 11:14:50.060
4592522 2019-10-07 11:15:15.073
4592523 2019-10-07 11:15:32.860
4592524 2019-10-07 11:16:12.360
I want to get the first RowNumber or on after a specific time, e.g.:
SELECT MIN(RowNumber) FROM Transactions WHERE TransactionDate >= '20191007 11:13:00' OPTION(RECOMPILE);
As expected, this performs a clustered index scan:
When your clustered index scan is 4.5M rows, and 2.1 GB, you want to avoid doing that every few seconds. In practice this query takes so long that it, for all intents and purposes, never returns.
Covering index on TransactionDate
I'm interested in TransactionDate and RowNumber, so i create an index on it
CREATE INDEX IX_Transactions_TransactionDate ON Transactions
(
TransactionDate
)
(since RowNumber is the unique cluster key, it will implicitly be part of the index).
Re-run a query
And now i'll run a logically identical query:
SELECT TOP(1) RowNumber FROM Transactions ORDER BY TransactionDate OPTION (RECOMPILE)
And this, as expected, seeks through the new covering index, returning instantly after 3 logical reads:
The the original equivalent query doesn't use the covering index:
SELECT MIN(RowNumber) FROM Transactions WHERE TransactionDate >= '20191002 04:00:00.000' OPTION(RECOMPILE)
It never actually returns, so i can only get the estimated execution plan:
Why does TOP(1)
behave so much different, and better than MIN
?
Other questions indicate that:
- there should be no difference: MAX vs Top 1 - which is better?
- or that
MIN
should be better: SQL performance MAX()
And people will try to suggst:
Well try both, and see which is better.
Yes, i've done that. Now i'm trying to figure out why SQL Server is not running the query the way it should.
It must be statistics
The only excuse the server has to run two logically equivalent queries so horribly differently, would be statistics.
Clustered index statistics
Name Updated Rows Rows Sampled Steps Density Average key
------------------------- ----------------- ------- ------------ ------ ------- -----------
IX_Transactions_RowNumber Oct 7 2019 2:32AM 4561899 4561899 5 1 4
All density Average Length Columns
------------- -------------- ---------
2.19207E-07 4 RowNumber
RANGE_HI_KEY RANGE_ROWS EQ_ROWS DISTINCT_RANGE_ROWS AVG_RANGE_ROWS
------------ ------------- ------------- -------------------- --------------
9171 0 1 0 1
1700836 1687750 1 1687750 1
2755345 1048575 1 1048575 1
4592121 1825569 1 1825569 1
4592122 0 1 0 1
Covering index statistics
Name Updated Rows Rows Sampled Steps Density Average key length
-------------------------------------- ----------------- ------- ------------ ----- --------- ------------------
IX_Transactions_TransactionDate Oct 7 2019 2:33AM 4561899 4561899 120 0.8462854 12
All density Average Length Columns
------------- -------------- --------------------------
2.590376E-07 8 TransactionDate
2.19207E-07 12 TransactionDate, RowNumber
RANGE_HI_KEY RANGE_ROWS EQ_ROWS DISTINCT_RANGE_ROWS AVG_RANGE_ROWS
----------------------- ------------- ------------- -------------------- --------------
...snip...
2019-09-22 03:06:09.883 32767 1 27923 1.173477
2019-10-02 19:10:18.007 32767 1 27714 1.182327
2019-10-07 02:30:21.680 14599 2 12430 1.174497
2019-10-07 02:31:56.807 0 1 0 1
So i have to ask:
- if you were the optimizer
- and you had these two logically equivalent queries
- and these statistics
- why would you choose to scan 4.5M rows of a table
- rather than seek to the exact answer?
tl;dr:
--Clustered index scan of 4.5M rows / 2.1 GB
SELECT MIN(RowNumber) FROM Transactions WHERE TransactionDate >= '20191002 04:00:00.000' OPTION(RECOMPILE)
--Covering index seek; 3 logical reads
SELECT TOP(1) RowNumber FROM Transactions WHERE TransactionDate >= '20191002 04:00:00.000' ORDER BY TransactionDate OPTION(RECOMPILE)
why?
Bonus Chatter
I'm trying to censor some table names. Governments get touchy about that sort of thing. I included OPTION(RECOMPILE)
in case anyone tried to hand-wave an answer about cached execution plans. Of course you don't have that in production. duh-doy
Sample scripts
These sample scripts include unrelated columns and indexes:
CREATE TABLE Transactions (
Column1 varchar(50) NOT NULL,
RowNumber int NOT NULL,
Column3 varchar(50) NULL,
Column4 varchar(50) NULL,
Column5 varchar(50) NULL,
Column6 varchar(50) NULL,
Column7 varchar(50) NULL,
Column8 varchar(50) NULL,
Column9 varchar(50) NULL,
Column10 varchar(50) NULL,
Column11 varchar(50) NULL,
Column12 varchar(50) NULL,
Column13 varchar(50) NULL,
Column14 varchar(50) NULL,
TransactionDate datetime NOT NULL,
Column16 varchar(50) NULL,
Column17 varchar(50) NULL,
Column18 varchar(50) NULL,
Column19 varchar(50) NULL,
Column20 varchar(50) NULL,
Column21 varchar(50) NULL,
Column22 varchar(50) NULL,
Column23 varchar(50) NULL,
Column24 varchar(50) NULL,
Column25 varchar(50) NULL,
Column26 varchar(50) NULL,
Column27 varchar(50) NULL
)
CREATE NONCLUSTERED INDEX [IX_Tranasctions_Index1] ON [dbo].[Transactions] (Column7 ASC) INCLUDE (Column12)
CREATE NONCLUSTERED INDEX [IX_Transactions_Index2] ON [dbo].[Transactions] (Column13 ASC)
CREATE NONCLUSTERED INDEX [IX_Transactions_Index3] ON [dbo].[Transactions] (Column5 ASC, TransactionDate ASC) INCLUDE (Column1, Column7, Column11, Column16, Column17, Column18) WHERE (Column5='1')
CREATE NONCLUSTERED INDEX [IX_Transactions_Index4] ON [dbo].[Transactions] (Column11 ASC) INCLUDE (Column7)
CREATE UNIQUE CLUSTERED INDEX [IX_Transactions_RowNumber] ON [dbo].[Transactions] ([RowNumber] ASC)
CREATE NONCLUSTERED INDEX [IX_Transactions_TransactionDate] ON [dbo].[Transactions] (TransactionDate)
CREATE NONCLUSTERED INDEX [IX_Transactions_Index7] ON [dbo].[Transactions] (Column9)
CREATE NONCLUSTERED INDEX [IX_Transactions_Index8] ON [dbo].[Transactions] (Column7, Column8) WHERE (Column7 IS NOT NULL)
CREATE NONCLUSTERED INDEX [IX_Transactions_Index9] ON [dbo].[Transactions] (Column13) INCLUDE (Column7)
ALTER TABLE [dbo].[Transactions] ADD CONSTRAINT [PK_Transactions] PRIMARY KEY NONCLUSTERED (Column1)
Indexes contain the cluster key
There seems to be some confusion from people who don't really understand how an index works.
This table is unique-clustered by RowNumber. That means that RowNumber uniquely identifies the row.
Lets create a hypothetical Customers table, clustered by CustomerID:
| CustomerID | FirstName | LastName |
|------------|-----------|-------------|
| 1 | Ian | Boyd |
| 2 | Tim | Biegeleisen |
| 3 | Gordon | Linoff |
When it comes time to create a non-clustered covering index, you specify the data you want to index. E.g. for a hypothetical covering index on firstname+lastname:
| FirstName | LastName |
|-----------|----------|
That means that literally the database will store:
| FirstName | LastName |
|------------|-------------|
| Gordon | Linoff |
| Ian | Boyd |
| Tim | Biegeleisen |
But that's not all it will store. It also has to store the Cluster key value.
Each entry in an index has to point back to the original data-row that the index entry points to.
So internally the index contains another column - the cluster value:
| FirstName | LastName || Cluster key |
|-----------|----------||-------------|
Which in our case in the CustomerID:
| | || Cluster key |
| FirstName | LastName || (i.e. CustomerID) |
|------------|-------------||-------------------|
| Gordon | Linoff || 3 |
| Ian | Boyd || 1 |
| Tim | Biegeleisen || 2 |
Here's the cool thing: if you had a query that made use of the index, that query can return the cluster key value without having to return to the original full table - because the CustomerID already exists in the index!
SELECT CustomerID FROM Customers WHERE FirstName = 'Ian'
The database can use your covering index to return CustomerID- even though you didn't specify CustomerID in your index. Pretty neat huh?
You can even test it for yourself.
And you can even see it in my original screenshots above (since SQL Server did it when asked). You can also verify it by looking at the showplan:
|--Top(TOP EXPRESSION:((1)))
|--Index Seek(OBJECT:(Transactions.IX_Transactions_TransactionDate]),
SEEK:(Transactions.[TransactionDate] >= '2019-10-02 04:00:00.000') ORDERED FORWARD)
You can also see it in the statistics i included above:
All density Average Length Columns
------------- -------------- --------------------------
2.590376E-07 8 TransactionDate
2.19207E-07 12 TransactionDate, RowNumber
You can also see it in the execution plan: return a column from a index when "the index doesn't even contain that column - how can you even call that a covering idnex!?".
Because it covers what we need:
Works fine on 4.2M row table
I went back to an old copy of the database (4.2M rows vs 4.5M today) and:
- it correctly uses the covering index
CREATE INDEX ... (TransactionDate)
- to return the
RowNumber
- in both cases
So it must be something related to the statistics, and the optimizer.
But in both cases the statistics are up-to-date with FULLSCAN
.
This is where i need someone more knowledgeable about the optimizer than I. But since this is just one unanswered query among millions, and people have already down-voted it out-of-hand, i'll never get an explanation:
- why for two queries that are logically equivalent
- and have everything it needs in an already-sorted index
- one chooses to use the index
- and the other does a full table scan
The dump of the statistics contains everything the optimizer knows about the data - the answer must be there.