4

I have just taken over a database which has around 2200 tables. Over 2000 of these have no clustered index (some have no indexes at all).

All of the tables have been configured to use a GUID as the uniqueidentifier.

Just looking at the query plans, I can see that there are many table scans occurring. Most searches use the uniqueidentifier to search on.

I am wondering if it is better to have a clustered index on the GUID than not to have a clustered index at all. I imagine that a clustered index on a 16-byte column will inevitably lead to fragmentation.

I could arguably cluster on other columns but the majority of searches tend to search by or join via the GUIDS.

Any advice would be very much welcomed. I've never seen so many GUID's!!

  • As always -- it depends. Is this database heavy on reading and light on writing, or the other way around? Are there many queries that only want a single row, or many that scan the whole table? One approach if there are many row lookups and many inserts is to add an identity, cluster on that and add a non-clustered index to the GUID column. That index will fragment, but it's no big deal if it only serves row lookups. – Jeroen Mostert Sep 22 '17 at 11:03
  • 1
    One other trick in the box, if the GUIDs are produced by `NEWID()` default constraints, is to patch them to `NEWSEQUENTIALID()`. That will at least cut down on fragmentation for new inserts. – Jeroen Mostert Sep 22 '17 at 11:05
  • Thanks Jeroen - my other dilemma is that I am merely the DBA in this & the power to add fields (in this case the identity) is with 3rd Party developers. Based on my recent conversations with them, I am pretty certain that they will be reluctant to add such functionality! They really could have used Identity on an INT, not sure why they have committed to so many GUIDS. – Stevie Gray Sep 22 '17 at 11:08
  • To be clear: you aren't required to convince the developers to start using integer IDs. Adding a clustered index on a sequential ID that is auto-generated by the database but otherwise completely unused, and a non-clustered index on the GUID, would already improve matters if there are many row lookups. Of course, if they demand that no alterations to the tables be done at all, your job is harder... but then they should take responsibility for their own mess. There's only so much you can do from the database layer itself. – Jeroen Mostert Sep 22 '17 at 11:09
  • Never a truer word!!! Thanks for your advice – Stevie Gray Sep 22 '17 at 11:16

3 Answers3

5

In generally, I would recommend having an identity column as the primary key and use that for clustering. This is also a better choice for joins.

Why? First, identity keys are generally shorter that unique ids. So, foreign key references and indexes are smaller.

More importantly, inserts would always go at the "end" of the table. When using GUIDs, inserts are often going to cause fragmentation. If you are inserting rows, I would say that a secondary index on the GUID might be better than a clustered index (the fragmentation is only in the index).

With 2000 tables, I doubt you will change the structure. You can ameliorate the fragmentation using newsequentialid().

Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
  • Thanks Gordon, I will definitely research the possibility of newsequentialid(), looks like an improvement. I'm not sure I will be allowed to add an identity column and therefore have to make the most of what I have. I am allowed to add indexes, stats and tune queries....but I am not allowed to alter the structure of the tables, hence why I'm wondering if I should cluster on these tables? – Stevie Gray Sep 22 '17 at 11:13
  • Suppose you can't have an `identity` column as the PK. This leaves us with one of: cluster on GUID PK (=fragmentation), cluster on a non-PK `identity` column whose only purpose is clustering, or leave it a heap table. How do the last two compare? Is a table clustered on a column that is never actually used directly for lookups/joins better than a heap table? – Roman Starkov Feb 25 '20 at 09:18
  • @RomanStarkov . . . A question should be asked as a *question* not a comment. – Gordon Linoff Feb 25 '20 at 11:13
  • I considered that but I expect it to get closed as a duplicate of this one... even though this particular subtlety isn't covered yet. – Roman Starkov Feb 25 '20 at 12:38
5

GUID column with random values usually is not the best choice for a clustered index because it could be the root cause of an index fragmentation:

  1. Read ahead opportunity of the database won't be effective;
  2. The cost of insert operations will be too expensive, because in this case you'll got lots of page split overhead;

There are 3 ways how you can live with that:

  1. Schedule planning index reorganizing and rebuilding which will reduce index fragmentation and improve your statistics automatically;
  2. Use newsequantialid for generating values of this column;
  3. Generate GUID value sequantialy outside of the database (Guid.Comb Identifier is a great example of solving this issue in NHibernate).
Maxim Zhukov
  • 10,060
  • 5
  • 44
  • 88
1

This is a really a comment on your question to Gordon's good answer:

Firstly, don't forget to check the index DMVs to see which ones are being used (or not used) and have a look at the expensive query plans in the cache to focus on the tables and queries that will be causing most pain. I would expect that many of those 2200 tables are relatively small & the queries are able to look up pretty quickly even from the guid clustered index.

For those tables that aren't clustered, clustering on the guid would reduce fragmentation, since it forces all the data for the table to be colocated rather than allowing pages to be put in the next free extent & spreading tables all over the disk. This should make some of the I/O more efficient.

Check you have a low enough fillfactor so that your regular index rebuilds avoid page splitting in advance, although it will also be workload dependent (OLTP vs DW and read/write ratio of table)

If you have applications that are doing explicit column selects/inserts then you may be able to add an identity column without breaking anything. That allows you cluster around the identity & add an index to the guid. Whether this really helps depends on the relative (in)efficency of the new plans.

You could consider clustering around a non-guid field where queries will lookup against it fairly regularly (eg, a date range) and index the guid separately. You'd have to look at the queries & relative performance for that more closely.

simon coleman
  • 349
  • 2
  • 5
  • Thanks for this Simon, I think the correct indexing solution needs to be performed on a per table basis. From all of the answers and my other research, it appears that clustering on the GUID should really be avoided if possible. I will speak with the developers and see whether an identity column could potentially be added. However, if I were to design these databases myself, using an identity column in the first place would have eradicated this. – Stevie Gray Sep 22 '17 at 15:07