Is Id column required / recommended in fact table in the given scenario

Question

I am using a fact table with the following structure in SQL Server 2012:

CREATE TABLE [dbo].[factTable] (
    [Id]            BIGINT      IDENTITY (1, 1) NOT NULL,
    [Date]          DATE        NOT NULL,
    [MinuteNumber]  SMALLINT    NOT NULL,
    [CityId]        INT         NOT NULL, /* Foreign key to dimCity */
    [Value]         DECIMAL(12, 4)  NULL
)

I have a clustered index on the Date column with a fill factor of 100. The data inserted into this table is almost always in the ascending order of Date and MinuteNumber.

I want to know - if having the Id column is necessary in the given scenario? Does it have any performance implications? Or can I safely eliminate it.
I also want to know if having clustered index on Date column is sufficient (there will be many records with the same date, even same date and same minute-number) or is it better to have a clustered index combining multiple columns; and what are the performance and storage implications for either approach?

I am new to this and any help will be highly appreciated.

In that case you have created a non-unique clustered index if you date column has similar data. I would rather suggest having the clustered index on ID column and then probably you can create a unique non-clustered index on date+city-id — Rahul, May 02 '14 at 06:44
Often in datawarehouses it makes more sense for the fact table to link to a dimension table for dates and cities etc. This enables the DimDate table to hold values such as year, month, day of week, quarter etc and allow greater flexibility and ease of querying. The key value for the dimension table is often just the date expressed as an integer e.g todaywould be 20140502. There are lots of articles on this and some good books such as the microsoft datawarehouse toolkit by Kimball. — Steve Ford, May 02 '14 at 12:14

score 1 · Answer 1 · answered May 02 '14 at 06:48

1

A clustered index must be unique, so if you do decide to go with DATE, you'll need another column(s) which together would always be unique. A clustered index also controls the order of the data physically, so the key should be one that's in ever ascending order. Again, something that your DATE seems to have, which you got right.

However, it would be good to know how much data your table is going to have, and how many nonclustered indexes you plan on using? Since every nonclustered index leaf record includes a pointer to the clustered index, you don't generally speaking want your clustered index to be any larger than it has to be.

Basically the advantages of a simple autointeger number as the key column for a clustered index are that it's effective storage-wise, it always increases in order, and it has good synergy with other objects and use cases as well.

marc_s, a user here, posted a link to another site (link), I think you should definitely check it out.

But to summarize, a clear majority of the time the safe bet is to keep this simple and just put a clustered index on your basic int / bigint identity column, then use nonclustered indexes to optimize searches on specific columns in the table. This is more than good enough for most of the time. No need to complicate things and look for 5% improvement on queries already running more than fast enough. So, the question is, is there any reason for you to expect a standard solution would not work in your case? Like, a huge amount of data (talking bigint scale rows here, exceeding several billions for instance), other performance implications (complex conditional joins to other tables in the same db), or other things like that?

answered May 02 '14 at 06:48

Kahn

1,630
1
13
23

Thanks for your response. I already have ~43 million records and I expect to have double or more than that in the near future. I now understand that the unique clustered key is used by non-clustered index so it is a better idea to use index on Identity. Most of my queries will be primarily using Date field along with City frequently. So is your recommendation to add non-clustered index on Date and CityId? My inserts are not frequent so slow inserts are acceptable; however I am already getting several minutes to Merge data into this table from a transform table. – user3446562 May 02 '14 at 07:04
@user3446562: the clustered index columns are the most redundant thing in a SQL Server database - so that smaller they are, the better. Also a good clustered index must be unique - and a `DATE` alone usually doesn't provide that. I would almost always opt for an `ID INT IDENTITY` (or `BIGINT`) - it's just the safest, easiest, and most efficient in the vast majority of cases. – marc_s May 02 '14 at 07:21
3

-1 for the "clustered index must be unique" myth. It has to be narrow, ascending and stable, but SQL Server will add a uniquifier if needed. And CI can very efficiently support range queries. PK must be unique, but it doesn't have to be clustered. That being said, I'd go for identity column here anyway. – dean May 02 '14 at 07:27
I assume that by merge you mean the actual merge, meaning comparing records between source and target and updating / deleting / inserting where applicable? In this case it would be helpful to see the actual merge SQL. Now since we don't know your full scenario, I'm doing some guessing here, but I would assume that changing Id to clustered index and then creating a nonclustered index on Date and CityId (if indeed you use BOTH at the same time), and checking the filters in the merge query to optimize it, should be enough. – Kahn May 02 '14 at 07:27
As @dean said, if a clustered index can efficiently support the range queries then it sounds like Date field is the right choice for the clustered index. It is narrow, stable and ascending, and I need to perform range queries on it almost all the time. It is just not unique. Now I am confused as there are contradictory ideas. Please clarify. Kahn, you are correct about the Merge issue and I agree that it is slightly off topic for this thread. – user3446562 May 02 '14 at 08:02
@dean: so if the clustered index doesn't have to be unique - why does SQL Server add a *uniquifier* ? The clustered index **must be** unique - either you provide it as unique, or SQL Server will make it unique - whatever way - the clustered index (which identifies the location of each row) **must be unique**. Period. – marc_s May 02 '14 at 15:05
1

@marc_s OK, semantics :) Let me rephrase. The values in column(s) on which a CI is declared **don't have to be unique**, period. Adding a uniquifier is just an implementation detail, one need not be concerned with. What really matters here is that CI should not be chosen for uniqueness, but for some other qualities. – dean May 02 '14 at 15:30

score 0 · Accepted Answer · answered May 02 '14 at 13:41

In your case, I'd probably create a nonclustered primary key on the identity column, to allow for easier FK relationship management and for performance.

The clustered key would be on the date column, to allow for faster range queries. The date column also fulfills the three basic requirements for a clustered index: it's narrow (to make nonclustered indexes smaller), it's stable (because a change on a CI column means reshuffling the NC indexes as well, this is to be avoided) and it's increasing (to avoid bad page splits, the ones not at the end of the table).

WRT non-unique clustered index, SQL Server will add a uniquifier data to it if it's not unique.

Thanks Dean. Do I need the identity column at all in this fact table if I am not going to reference the data on this table from anywhere else (i.e. no foreign keys)? — user3446562, May 02 '14 at 13:55

Is Id column required / recommended in fact table in the given scenario

2 Answers2