Explanation on index on a datetime field and included columns

Question

I have a sqlserver table with the usual

intID(primary key),field1,field2,manyotherfields..., datetime TimeOperation

99% of my different kind of queries start with a TimeOperation BETWEEN startTime AND endTime, and then select * (or count(*)) where fieldA=xxx, and join with other smaller tables. select * because more or less I need all the fields.

I obviusly created an index on TimeOperation ... but performance are not good enough, so I want to add some index key columns or index included columns, but I'm a little bit confused.

I get the difference between the two, but I don't get how much adding a column in each case impacts on speed and on size.

I guess that the biggest improvement would be to create an index including ALL the columns, is it right? (but I can't afford it in terms of space)

And if I often use field1=xxx for example, adding field1 to the index key columns (after TimeOperation) would give better performance right?

Also...just to be sure how an index with included columns works: if I select rows with TimeOperation in a certain range, sql seeks my TimeOperation index for the rows I'm interested in, and it is faster than scanning all the table because in the index the TimeOperation values are in ascending order, is it right? But then I need all the data now I need all the rest of the data fields of those rows...how does sql acts to retrieve the data? I guess it has a sort of bookmark to those rows in the index, right? But it has to hit the table multiple times then... so including all the columns in the index will save the time to hit the table, it it correct?

Thanks! Mattia

"`select *` because more or less I need all the fields." isn't a very convincing reason. Imagine if one of the columns is a `varchar(max)`, where each row stores a 1GB document. If you're not *using* that column, that's a lot of wasted resource to retrieve them. Even if it's not currently the case, what if such a column gets added 6 months down the line? All of a sudden, all of you "`select *` since I need most of the columns" queries become hideously slow. — Damien_The_Unbeliever, Sep 14 '12 at 14:46
SELECT * is lazy. Please read this post, including a workaround for not having to do all that time-consuming, knuckle-killing typing (just remove the columns you don't need): https://sqlblog.org/blogs/aaron_bertrand/archive/2009/10/10/bad-habits-to-kick-using-select-omitting-the-column-list.aspx — Aaron Bertrand, Sep 14 '12 at 14:47
sorry I expressed myself badly, I didn't mean that I use "select *", I extract only the column I need in the different query. What I meant is that in EVERY query I use the datetime range,and considering all the queries I request all the columns... I meant: for the performance solution I need to concentrate on the how to better sort the datetimes — Mattia Durli, Sep 14 '12 at 22:37

score 1 · Accepted Answer · edited May 23 '17 at 12:03

We will need more information on your table examples of your queries to address this fully, but:

DateTime columns should be highly selective by themselves, so an index with TimeOperation as the first column should address the bulk of queries against TimeOperation.
Do not add all columns blindly to an index, or even on included indexes - this will make the index page density worse and be counter productive (you would be duplicating your table in an index).
If all data in your database centres around TimeOperation, you might consider building your clustered index around it.
If you have queries just on field1 = x then you need a separate index just for field1 (assuming that it is suitably selective), i.e. no TimeOperation on the index if its not in the WHERE clause of your query.
Yes, you are right, when SQL locates a record in an index, it needs to do a key (or RID) lookup back into the cluster to retrieve the rest of the columns. If your non clustered index Includes the other columns in your select statement, the lookup can be avoided. But since you are using SELECT(*), covering indexes are unlikely to help .

Edit

Explanation - Selectivity and density are explained in detail here. e.g. iff your queries against TimeOperation return only a small number of rows (rule of thumb is < 5%, but this isn't always), will the index be used, i.e. your query is selective enough for SQL to choose the index on TimeOperation.

The basic starting point would be:

CREATE TABLE [MyTable]
(
  intID INT ID identity(1,1) NOT NULL,
  field1 NVARCHAR(20),
  -- .. More columns, which may be selected, but not filtered 
  TimeOperation DateTime,

  CONSTRAINT PK_MyTable PRIMARY KEY (IntId)
);

And the basic indexes will be

CREATE NONCLUSTERED INDEX IX_MyTable_1 ON [MyTable](TimeOperation);
CREATE NONCLUSTERED INDEX IX_MyTable_2 ON [MyTable](Field1);

Clustering Consideration / Option

If most of your records are inserted in 'serial' ascending TimeOperation order, i.e. intId and TimeOperation will both increase in tandem, then I would leave the clustering on intID (the default) (i.e. table DDL is PRIMARY KEY CLUSTERED (IntId), which is the default anyway).

However, if there is NO correlation between IntId and TimeOperation, and IF most of your queries are of the form SELECT * FROM [MyTable] WHERE TimeOperation between xx and yy then CREATE CLUSTERED INDEX CL_MyTable ON MyTable(TimeOperation) (and changing PK to PRIMARY KEY NONCLUSTERED (IntId)) should improve this query (Rationale: since contiguous times are kept together, fewer pages need to be read, and the bookmark lookup will be avoided). Even better, if values of TimeOperation are guaranteed to be unique, then CREATE UNIQUE CLUSTERED INDEX CL_MyTable ON MyTable(TimeOperation) will improve density as it will avoid the uniqueifier.

Note - for the rest of this answer, I'm assuming that your IntId and TimeOperations ARE strongly correlated and hence the clustering is by IntId.

Covering Indexes

As others have mentioned, your use of SELECT (*) is bad practice and inter alia means covering indexes won't be of any use (the exception being COUNT(*)). If your queries weren't SELECT(*), but instead e.g.

SELECT TimeOperation, field1
FROM 
WHERE TimeOperation BETWEEN x and y -- and returns < 5% data.

Then altering your index on TimeOperation to include field1

CREATE NONCLUSTERED INDEX IX_MyTable ON [MyTable](TimeOperation) INCLUDE(Field1);

OR adding both to the index (with the most common filter first, or the most selective first if both filters are always present)

CREATE NONCLUSTERED INDEX IX_MyTable ON [MyTable](TimeOperation, Field1);

Either will avoid the rid / key lookup. The second (,) option will address your query where BOTH TimeOperation and Field1 are filtered in a WHERE or HAVING clause.

Re : What's the difference between index on (TimeOperation, Field1) and separate indexes?

e.g.

CREATE NONCLUSTERED INDEX IX_MyTable ON [MyTable](TimeOperation, Field1);

will not be useful for the query

SELECT ... FROM MyTable WHERE Field1 = 'xyz';

The index will only be useful for the queries which have TimeOperation

SELECT ... FROM MyTable WHERE TimeOperation between x and y;

OR

SELECT ... FROM MyTable WHERE TimeOperation between x and y AND Field1 = 'xyz';

Hope this helps?

Can you elaborate "DateTime columns should be highly selective by themselves", I don't understand exactly what you mean. And if I have a query with TimeOperation and field1 both in the where clause, what's the difference in having a TimeOperation index and a field1 index, and having a single TimeOperation,field1 index? Thanks! — Mattia Durli, Sep 14 '12 at 22:44
Yes,the extended answer clarified a lot of things. Unfortunately TimeOperation is not unique and not necessarily in ascending order, so I need to keep IntID as primary key. Thank you very much. — Mattia Durli, Sep 17 '12 at 11:46

score 0 · Answer 2 · answered Sep 14 '12 at 14:42

An index, at its most basic, creates a layer of the "hypertree" structure behind the scenes, which allows the SQL engine to more easily find rows with particular values for indexed columns. Each index creates a different way to "drill down" into the table's data using a binary search (logN performance). Each index you add makes selecting by that index faster, at the cost of slowing insertions/updates (the data must be put in and then indexes must be created).

An index, therefore, should normally be created for combinations of columns that are commonly used to filter records. I would indeed create an index on TimeOperation, and TimeOperation alone.

NEVER simply create an index including all columns of a table, especially a wide one such as this.

I already created the TimeOperation index alone, but a simple query like "select TOP 1 * from mytable where field1=5 order by TimeOperation DESC" to get the latest row where field1=5 takes up to 8 seconds (8 million of rows table) — Mattia Durli, Sep 14 '12 at 22:49

Explanation on index on a datetime field and included columns

2 Answers2