1

This is a long, involved question about index optimization theory. This is not homework, though I was first exposed to this question in a sample exam for Microsoft's 70-432. The original question was about general query optimization, but then I found this peculiar behavior I could not explain.

First, the table:

CREATE TABLE Invoice_details (
Invoice_id int NOT NULL,
Customer_id int NOT NULL,
Invoice_date datetime DEFAULT GETDATE() NULL,
Amount_total int NULL,
Serial_num int IDENTITY (1,1) NOT NULL)

Now, a clustered index, and the two indexes for testing:

CREATE UNIQUE CLUSTERED INDEX [ix_serial] ON [dbo].[Invoice_details] ([Serial_num] ASC)
/* Below is the "original" index */
CREATE NONCLUSTERED INDEX [ix_invoice_customer] ON [dbo].[Invoice_details] 
    ([Invoice_id] ASC,[Customer_id] ASC)
/* Below is the "optimized" index (adds one included field) */
CREATE NONCLUSTERED INDEX [ix_invoice_customer_inc] ON [dbo].[Invoice_details] 
    ([Invoice_id] ASC,[Customer_id] ASC) INCLUDE ([Invoice_date])

I also added some random test data to the table - 100000 rows. Invoice_id, Customer_id, and Amount_total each received their own random values (range 1000-9999), and Invoice_date received GETDATE() plus a random number of seconds (range 1000-9999). I can provide the actual routine I used, but did not think the specifics would be relevant.

And finally, the query:

SELECT Invoice_id,Customer_id,Invoice_date FROM Invoice_details WHERE Customer_id=1234;

Obviously, the query's first step will be a nonclustered index scan. Regardless of which index is used, that first step will return the same number of index rows. With the "original" index, the next step will be a lookup via the clustered index to retrieve Invoice_date, followed by an internal JOIN between the two sets. With the "optimized" index, that field is included in the index leaf, so the planner goes straight to returning the results.

Which index results in faster execution, and why?

  • BTW: Are you going to enlighten us about the result you got that you found surprising? – Martin Smith Sep 06 '11 at 10:35
  • I was not expecting the included index to perform more poorly. In the original test question, my answer was to create a new nonclustered index on Customer_id. That was wrong...the test said the included field was more of an optimization. During my tests, the results showed that not only was my method faster, the included field actually decreased performance versus the clustered index scan it used by default. – Wanna-be-guru Sep 10 '11 at 09:44

2 Answers2

1

It depends ... on the tipping point.

Bogdan Sahlean
  • 19,233
  • 3
  • 42
  • 57
0

Assuming no issues such as fragmentation then it comes down to selectivity of the query.

The 2 indexes are very similar. Because the "optimized" one includes an additional column in the leaf pages then a full scan of that index may well mean more pages need to be read compared to the original one. However if more than a handful of rows are due to be returned I would expect the benefit of not needing the lookup to very quickly outweigh this minor disadvantage.

Martin Smith
  • 438,706
  • 87
  • 741
  • 845
  • @@sahlean - I just tried to upvote your answer because you beat me to including [the "tipping point"](http://www.sqlskills.com/BLOGS/KIMBERLY/category/The-Tipping-Point.aspx) link but you have deleted it! – Martin Smith Sep 06 '11 at 09:59
  • I thought the answer (my answer) is too simple. – Bogdan Sahlean Sep 06 '11 at 10:10
  • @@sahlean - No I think that link says it all really. I was going to link to it myself and was quite annoyed you had beaten me to it! – Martin Smith Sep 06 '11 at 10:13
  • Your answer (both of you) is dead on, as shown by further experimentation. The additional include field increases the size of the index, thus making the scan slower due to increased page reads. After a certain point (i.e., table size), the speed advantage goes to the clustered lookup. Points to sahlean for "first". :) Thanks much. – Wanna-be-guru Sep 10 '11 at 09:41