Suppose we'd have to define optimal indexing for Stackoverflow questions. But let's not take the schema of the actual Posts table, let's just include those columns that are actually relevant:
create table Posts (
Id int not null
identity,
PostTypeId tinyint not null,
LastActivityDate datetime not null
default getdate(),
Title nvarchar(500) null, -- answers don't have titles
Body nvarchar(max) not null,
...
)
I've added Id
to be identity even though Data Stackexchange shows that none of the tables have a primary key constraint on them, nor identity columns. There are many just unique/non-unique clustered/non-clustered indices.
Usage scenarios
So basically two main scenarios for posts:
- They're chronologically displayed in descending order by their
LastActivityDate
column (or maybeLastEditDate
that I haven't included above as it's not so important) - They're individually displayed on question details
- Answers are displayed on question details page in votes order (
ScoreCount
column not part of my upper code)
Indexing optimization
Which indices would be best created on above scenarios especially if we'd say that #1 is the most common scenario so it has to work really fast.
I'd say that one of the better possibilities would be to create these indices:
-- index 1
alter table Posts
add primary key nonclustered (Id);
-- index 2
create clustered index IX_Posts_LastActivityDate
on Posts(LastActivityDate desc);
-- index 3
create index IX_Posts_ParentId
on Posts(ParentId, PostTypeId)
include (ScoreCount);
This way we basically end up with three indices of which the second one is clustered.
So in order for #1 to work really fast I've set clustered index on LastActivityDate
column, because clustered indices are especially great when we do range comparison on them. And we would be ordering questions chronologically newest to oldest hence I've set ordering direction and also included type on the clustered index.
So what did we solve with this?
- scenario #1 is very efficiently covered by index 2 as it's clustered and fully covered; we can also easily and efficiently do result paging;
- scenario #2 is somewhat covered with unique index 1 (to get the question) and non-unique index 3 to get all related answers (scenario #3) ordered by
ScoreCount
; and if we decide to chronologically order answers that's also covered with index 2;
Question 1
SQL internals are such that SQL implicitly adds clustered key to nonclustering index so it can locate records in the row store.
- if clustering index is unique, than that's the key that will be added to nonclustering indices, and
- if clustering index is non-unique, SQL supposedly generates its own
UniqueId
and uses that
Since I've also added a nonclustered primary key on the table (which must by design be unique), I would like to know whether SQL will still supply its own unique key on clustered non-unique index or will it use nonclustered primary key to uniquely identify each records instead?
Question 2
So if primary key isn't used to locate records on row store (clustered index) does it even make sense to actually create a PK? Would in this case be better to rather do this?
create unique index UX_Posts_Id
on Posts(Id);
-- include (Title, Body, ScoreCount);
It would be great to also include commented out columns, but then that would make this index inefficient as it will be worse in caching... Why I'm asking whether it would be better to create this index instead of a primary key
constraint is because we can include additional non-key columns to this index while we can't do the same when we add a PK constraint that internally generates a unique index...
Question 3
I'm aware that LastActivityDate
changes which isn't desired with clustered indices, but we have to consider the fact that this column is more likely to change for some time before it becomes more or less static, so it shouldn't cause too much index fragmentation as records will mostly be appended to the end whenever LastActivityDate changes. Index fragmentation on some arbitrary page should never happen because some new record would be inserted into some old(er) page as LastActivityDate
will only increase. Hence most modifications will happen on the last page.
So the question is whether these changes can be harmful as LastActivityDate
isn't the best candidate for clustering index key:
- it's not unique - although one could argue about this, especially if we'd change
datetime
todatetime2
and use higher precision functionsysdatetime()
and set index asunique
- it's narrow - pretty much
- it's not static - but I've explained how it changes
- it's ever increasing