4

I have a table that has a primary key as a clustered GUID field; I'm generating GUIDs using NEWSEQUENTIALID() instead of NEWID. Unfortunately, because this table sees ~25k-100k inserts per day, within a few hours the (default: clustered) primary key index becomes 99% fragmented.

I originally used NEWID instead of generating sequential IDs, but even when I recreated the table and reinserted all the rows using NEWSEQUENTIALID (and specified that as the default value for the primary key column), I still see fragmentation in the order of 99% within a few hours. (The table currently has about 1.3 million records in it.

I had thought about replacing the GUID with an integer primary key, but I'm not sure if that will work; plus, since our team uses GUIDs for primary keys instead of integers going forward, I don't think I'll have enough buy-in to do this.

What are my options to keep this thing defragmented? I'm using SQL Server Express, so I don't have access to SQL Agent (and cannot, therefore, run a maintenance plan regularly to rebuild the index).

I also may very likely be splitting this database/table at some point in the future (because of the volume of data), so I will likely need GUIDs to merge the tables.

Also: I cannot use an indexed view, because I have an inner-select which would be difficult for me to unwind into a join.

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
ashes999
  • 9,925
  • 16
  • 73
  • 124

4 Answers4

7

In my own personal experience, tossing out GUIDs as your clustering key can have major, positive effects on your system - especially on index fragmentation!

My new INT IDENTITY clustering indices have hardly any fragmentation - even after months of intense, daily production use. Definitely worth it!!

Using the Guid datatype as clustering key in SQL Server is a horribly bad choice - whichever way you look at it...

See some of Kimberly Tripp's (Queen of Indexing) blog post on the topic:

and anything else she's blogged on the topic of clustering keys....

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
  • Can you give me a rough order of magnitude on how many inserts you see daily (to quantify your comment about "intense" use)? – ashes999 Jan 17 '12 at 15:13
  • @ashes999: roughly 5-15K inserts and updates per day. And even with your 25k-100k inserts per day - a clustered index on `int identity` will be just nice and smooth - hardly any fragmentation (<= 3-5%; mostly from delete's, if you have those) – marc_s Jan 17 '12 at 16:54
  • 1
    you are the man. This, plus some indexed views, took my query time from 16 minutes to 30 seconds. Thanks very much :) – ashes999 Jan 17 '12 at 17:17
2

I fully realize that I'm necro'ing a thread that was started over 8 years ago (at the time of this writing) but there's some serious misconceptions about NEWID(), NEWSEQUENTIALID(), "Ever-increasing-INTs", and a thing that I simply refer to as "ExpAnsive Updates" (with an "A"), which are really ExpEnsive (with an "E").

Let's cover the latter first, which is probably the real problem the OP is having...

With only a tiny difference, which doesn't matter much when it comes to the unwanted creation of page splits and the resulting fragmentation, NEWSEQUENTIALID and "Ever-increasing INTs" both work the same way... by themselves, they only create "good" page splits (which are also "bad" but that's a subject for a different discussion). So, with reference to the originally posted question where the Op stated that switching from the completely random NEWID to the "ever-increasing" NEWSEQUENTIALID seemed to make no difference in the amount of fragmentation that was being created.

The reason for that isn't the fact that NEWSEQUENTIALID has a problem (it doesn't). The fragmentation problem is most likely that the new rows are being inserted (which will cause NO fragmentation with NEWSEQUENTIALID) and then those new rows suffer another process to update them. If the updates are "ExpAnsive" where some variable width column in a row becomes wider, then that will cause massive page splits. This will happen even if you build the index with a fairly low FILL FACTOR because INSERTS DON'T STOP INSERTING INTO PAGES JUST BECAUSE THEY REACH THE FILL FACTOR. Instead, a large number of inserts will insert into pages until they're nearly 100% full (depending on the row count per page which depends on the width of the rows being inserted) and then create a new page using a "Good" page split with virtually no fragmentation as surely as if you were using an ever-increasing integer.

So you insert all of these rows in contiguous pages and they're filled to as close to 100% as they can get. Everything is fine... no fragmentation. But then you do "post insert processing" that updates the rows you just inserted. If the size of the rows are increased due to "ExpAnsive" then KAAAA-BOOOOOM!!! All of those totally full pages all end up splitting.

One of the most common sources of such expansion is when people use "poor man's auditing" and they have a "Modified_BY" column that goes from NULL to some value. There are many ways to fix that particular problem but, again, well beyond the scope of this thread and post.

Shifting gears to random GUIDs generated by NEWID()... there are a whole lot of reasons to not use them but, totally contrary to what you've been made to believe, fragmentation actually isn't one of them. I've given several presentations in a very "Alice's Restaurant Fashion" (lots of graphics and notations on the graphics) that prove it. To make a more than 1 hour presentation suitable for this post, I'll tell you that it all boils down to several small but deadly mistakes that people keep making...

  1. They keep using REORGANIZE because it's supposedly a "Best Practice" is the main problem. They don't realize that REORGANIZE doesn't actually work on GUIDs they way they think it might. Instead of providing extra space on the pages, it actually removes the extra space and, THAT, my fellow index wranglers, actually PERPETUATES the fragmentation of GUIDs. YOU MUST NOT USE REORGANZE WHEN DOING INDEX MAINTENANCE ON RANDOM GUIDS! PERIOD!!! Not even if you're using the Express or Standard Editions. If you don't have the time, resources, or disk space to REBUILD them, it's actually better to not do any index maintenance on random GUIDs than it is to do it wrong by using REORGANZE. Wait until you can do a REBUILD.

  2. You MUST set a lower FILL FACTOR on random GUID keyed indexes. Leaving them at "0" is almost as bad as REORGANIZEing them. Depending, of course, on how wide the rows of the index are, how many are inserted per day, and how long you want to go with absolutely ZERO page splits (not even supposed "good" ones!!!) on the random GUID indexes, I tell people to set their FILL FACTOR to 71, 81, or 91. The reason why I make all of those end with a "1" is because of the final thing you need to fix for random GUIDs when "ExpAnsive" updates aren't present, which is item #3 below.

  3. You MUST check indexes that are based on random GUIDs every bloody night. The reason I chose to give them all FILL FACTORs ending in "1" is because THAT is what you're looking for as a % of logical fragmentation. Just as soon as they go over the 1% mark, you MUST REBUILD them because nearly every page in the entire index is at the point where it's going to split. (I call these "Low Threshold Rebuilds"). Now, don't get confused. If everything is setup correctly and there are no "ExpAnsive" updates, your GUID keyed Clustered Indexes can go weeks with NO PAGE SPLITs or the related fragmentation and your much more narrow non-clustered indexes can quite literally go for MONTHS with absolutely no fragmentation!

The other big mistake is, of course, "ExpAnsive" updates. Those will kill just about everything but, surprisingly, random GUIDs will actually weather such an onslaughyt a whole lot better than most anything else using the same steps as above.

What you really need to do is fix the "ExpAnsive" updates so that they are no longer "ExpAnsive". Like I said, that's a whole 'nuther subject that's way to long for this post.

Jeff Moden
  • 3,271
  • 2
  • 27
  • 23
  • Very interesting! Could you please clarify item 3? Let's say I have FILL FACTOR 81, what should I check "every bloody night"? avg_fragmentation_in_percent < 1%? Also, is index rebuild really needed in this case if I have SSD as a storage, or your investigation was for spinning disks only? – avitenberg Jul 23 '21 at 16:15
  • 1
    Item number 3 is pretty clear although I can see why people might not think so. It does say to check for fragmentation and just as soon as the fragmentation goes of 1%, to a rebuild. You MUST NOT USE REORGANIZE ON RANDOM GUIDs because it does nothing to clean up the area above the Fill Factor. The reason it has to be 1% and not higher is because Random GUIDs are almost perfectly distributed. If one page starts to fragment, almost all the pages will also start fragmenting. If you wait longer, you will suffer from MASSIVE page splits and fragmentation. – Jeff Moden Jul 23 '21 at 20:29
  • 1
    If you want a super detailed explanation and demonstration, see the upcoming EightKB.Online event coming up on 28 July 2021 where I give a 1 hour presentation on the subject. I destroy the myth of GUID Fragmentation, demo how to use GUIDs to actually PREVENT fragmentation, and totally lay waste to what the whole world is mistakenly using and a "Best Practice" for index maintenance. It turns out that it's actually a WORST PRACTICE and prove it in the demos in my presentation. If you don't see this until after the event, they'll have a recording. – Jeff Moden Jul 23 '21 at 20:33
1

This is expected behavior for Guid indexes with large numbers of inserts. Most of the time you choose guids as keys only because the records are generated by multiple sources and you need to make individual sources don't step on each other's toes. An example here would be offline mobile devices. A worker in the field needs to create a new record when not connected, and so the mobile device can safely create the record with a guid as the key. When back online later on, the device can sync with the database safely without worrying about any key collisions.

If you're generating the guids on a single server, you're normally better off using a simple identity column. If you really want the guids, you can still include them... you just might want to think twice about using them for your clustered index. The only reason you might want to cluster on guid is if later on you'll come back to the table and query for one record at a time based on it's guid. That seems unlikely with the insert rate you're seeing. But, if it's the case, you can help mitigate things by reducing the fill factor on the index. This will increase the amount of disk space used (and mean more disk seeks later on), but pages will fill up faster and you'll avoid some of the index re-shuffling.

Another option to look at here if an integer-based key is out of the question is a sequential guid. This still provides uniqueness, while also creating less fragmentation.

Joel Coehoorn
  • 399,467
  • 113
  • 570
  • 794
  • If I understand your answer correctly, it can be summarized as "make the clustered GUID index a non-clustered index." Is that about right? Everything else you mentioned is already acknowledged background. – ashes999 Jan 17 '12 at 15:12
-2

have a look at this simple query in the stackexchange data explorer. It looks like newsequentialid() increments the most significant portion of the guid, rather than the least. This would be a likely cause of the fragmentation you are seeing.

If you must use guids, perhaps look into generating them via code and sending them through in your insert statements, rather than relying on the database generating them. Use the "comb" technique, to use the current timestamp as part of the guid, incrementing in the least significant digits.


edit

.. or if you don't want to generate them in code, you could do something like this within the database

CAST(CAST(NEWSEQUENTIALID() AS BINARY(10)) + CAST(GETDATE() AS BINARY(6)) AS UNIQUEIDENTIFIER)

as your default value, as per this modification to the above query

Community
  • 1
  • 1
rejj
  • 1,216
  • 7
  • 13
  • 1
    I don't think your answer is correct. `newsequentialid` is definitely less fragmentationary than `newid`. Creating your own GUID is a great way to break the guarantee on non-uniqueness by introducing bugs. But no DV :) – ashes999 Jan 17 '12 at 15:14
  • I can certainly understand why people might think the most significant portion of NEWSEQUENTIALID GUIDs might be like that from that poor bit of code that was cited but the answer is that GUIDs are displayed a whole lot differently than how they're processed. GUIDs are "Byte Shuffled" and that affects how they're sorted. Please do a search for "GUID BYTE SHUFFLED" for more information and the significance of what the byte positions in a GUID actually are. – Jeff Moden Apr 22 '20 at 23:53
  • @ashes999 - Incorrect. NEWSEQUENTIALID has ALL the problems an IDENTITY column has including some severe fragmentation with post insert expansive updates, has NONE of the incredible ADVANTAGES of Random GUIDs (yep... I typed that correctly), and it also has a possible security fault in that it contains the MAC Address of the issuing computer. I have a demo that clearly demonstrates how to insert 5.8 MILLION rows into a Random GUID Clustered Index with < 1% fragmentation and no index maintenance over a 58 day period and a whole lot more. Just not enough room on a post here on SO. – Jeff Moden Jul 23 '21 at 20:38
  • @Reg - As for NEWSEQUENTIALID not being sequential, it IS. It just doesn't sort the way humans are used to seeing it. As stated, the MSB is the left most byte of the right most six bytes. It sorts funny to humans but sorts perfectly according to the UNIQUEIDENTIFIER datatype, which is byte shifting in the way it is handled behind the scenes. You can prove it yourself by inserting 100K rows on a new table that uses NEWSEQUENTIALID and then check the fragmentation... it will be virtually zero fragmentation. – Jeff Moden Jul 23 '21 at 20:43