5

The problem that I'm trying to solve is:

1 - In our DB we have all the tables (also tables with millions of records) with a PK id column declared as VARCHAR(36). There is also a clustered index on it This of course as I read online it's an awful thing for performance, also because the db has a lot of reads, inserts, updates and deletes.

2 - We use Hibernate for our java web application as an ORM to this db

After extensive reading online I started changing the data type of those columns to UNIQUEIDENTIFIER with the default option newsequentialid() as this option should mitigate the fragmentation issues of our indexes.

What I noted is that the fragmentation issue persisted, the tables became heavily fragmented short after rebuilds (we do a full index rebuild every night).

Then I saw that all of our Hibernate mappings for the id columns consisted of this:

<id name="id" column="id" type="string">
        <generator class="guid"/>
</id>

When an insert occurred in our system the log showed that the insert was done after calling select newid(), so since this returns a random guid the insert would be placed in a random point in the index, thus causing fragmentation (this totally defeat the column data type change I did also).

So after another online search I tried to implement a guid generator in Hibernate by myself, implementing the interface IdentifierGenerator and using a time based generator with JUG (http://wiki.fasterxml.com/JugHome).

The code that generates the (I thought sequential) id is this:

String uuid = null;   
EthernetAddress nic = EthernetAddress.fromInterface();    
TimeBasedGenerator uuidGenerator = Generators.timeBasedGenerator(nic);                
uuid = uuidGenerator.generate().toString();

And I changed accordingly the mapping to this:

<id name="id" column="id" type="string">
            <generator class="my_package.hibernate.CustomSequentialGuidGenerator">
            </generator>
</id>

Then I tried to generate some test uuids to test their sequentiality (sequential in the uniqueidentifier way, so binary), this is a short list (every element is generated before the successive) :

314a9a1b-6295-11e5-8d2c-2c27d7e1614f
3d867801-6295-11e5-ae09-2c27d7e1614f
4434ac7d-6295-11e5-9ed1-2c27d7e1614f
491462c4-6295-11e5-af81-2c27d7e1614f
5389ff4c-6295-11e5-84cf-2c27d7e1614f
57098959-6295-11e5-b203-2c27d7e1614f
5b62d144-6295-11e5-9883-2c27d7e1614f

This look to me as alphabetically sequential, but not binary sequential.

The above test was done executing seven times the test application, it wasn't a loop.

I tried to insert those values in a column declared as unique identifier and after issuing a select on this column this is the list sql server outputs:

5389FF4C-6295-11E5-84CF-2C27D7E1614F
314A9A1B-6295-11E5-8D2C-2C27D7E1614F
5B62D144-6295-11E5-9883-2C27D7E1614F
4434AC7D-6295-11E5-9ED1-2C27D7E1614F
3D867801-6295-11E5-AE09-2C27D7E1614F
491462C4-6295-11E5-AF81-2C27D7E1614F
57098959-6295-11E5-B203-2C27D7E1614F

So I really don't understand what I should do and if I can use JUG as a sequential guid generator to avoid my fragmentation issues.

This is another JUG test, I tried 3 runs each time generating 10 guids with a loop:

Run 1

54bd156e-62a2-11e5-a1a7-2c27d7e1614f
54c3cc2f-62a2-11e5-a1a7-2c27d7e1614f
54caf820-62a2-11e5-a1a7-2c27d7e1614f
54d1aee1-62a2-11e5-a1a7-2c27d7e1614f
54d901e2-62a2-11e5-a1a7-2c27d7e1614f
54df9193-62a2-11e5-a1a7-2c27d7e1614f
54e64854-62a2-11e5-a1a7-2c27d7e1614f
54ecff15-62a2-11e5-a1a7-2c27d7e1614f
54f3b5d6-62a2-11e5-a1a7-2c27d7e1614f
54fa4587-62a2-11e5-a1a7-2c27d7e1614f

Run 2

87c66bcc-62a2-11e5-8e7c-2c27d7e1614f
87ccd46d-62a2-11e5-8e7c-2c27d7e1614f
87d3641e-62a2-11e5-8e7c-2c27d7e1614f
87d97e9f-62a2-11e5-8e7c-2c27d7e1614f
87e05c70-62a2-11e5-8e7c-2c27d7e1614f
87e6ec21-62a2-11e5-8e7c-2c27d7e1614f
87ed7bd2-62a2-11e5-8e7c-2c27d7e1614f
87f40b83-62a2-11e5-8e7c-2c27d7e1614f
87fac244-62a2-11e5-8e7c-2c27d7e1614f
880103d5-62a2-11e5-8e7c-2c27d7e1614f

Run 3

a4b690db-62a2-11e5-b667-2c27d7e1614f
a4bcd26c-62a2-11e5-b667-2c27d7e1614f
a4c2eced-62a2-11e5-b667-2c27d7e1614f
a4c92e7e-62a2-11e5-b667-2c27d7e1614f
a4cf48ff-62a2-11e5-b667-2c27d7e1614f
a4d5d8b0-62a2-11e5-b667-2c27d7e1614f
a4dc6861-62a2-11e5-b667-2c27d7e1614f
a4e34632-62a2-11e5-b667-2c27d7e1614f
a4e9d5e3-62a2-11e5-b667-2c27d7e1614f
a4f101d4-62a2-11e5-b667-2c27d7e1614f

Run 4

c2b872b2-62a2-11e5-b855-2c27d7e1614f
c2c17363-62a2-11e5-b855-2c27d7e1614f
c2c82a24-62a2-11e5-b855-2c27d7e1614f
c2ce92c5-62a2-11e5-b855-2c27d7e1614f
c2d57096-62a2-11e5-b855-2c27d7e1614f
c2dc2757-62a2-11e5-b855-2c27d7e1614f
c2e32c38-62a2-11e5-b855-2c27d7e1614f
c2e9bbe9-62a2-11e5-b855-2c27d7e1614f
c2f099ba-62a2-11e5-b855-2c27d7e1614f
c2f7507b-62a2-11e5-b855-2c27d7e1614f

Run 5

f0263d1b-62a2-11e5-8529-2c27d7e1614f
f02d1aec-62a2-11e5-8529-2c27d7e1614f
f033d1ad-62a2-11e5-8529-2c27d7e1614f
f03a615e-62a2-11e5-8529-2c27d7e1614f
f041181f-62a2-11e5-8529-2c27d7e1614f
f047a7d0-62a2-11e5-8529-2c27d7e1614f
f04dc251-62a2-11e5-8529-2c27d7e1614f
f05403e2-62a2-11e5-8529-2c27d7e1614f
f05a6c83-62a2-11e5-8529-2c27d7e1614f
f0608704-62a2-11e5-8529-2c27d7e1614f

Run 6 (Started from 0 again)

00fd4ec3-62a3-11e5-8ab8-2c27d7e1614f
01042c94-62a3-11e5-8ab8-2c27d7e1614f
010b3175-62a3-11e5-8ab8-2c27d7e1614f
0111e836-62a3-11e5-8ab8-2c27d7e1614f
0118ed17-62a3-11e5-8ab8-2c27d7e1614f
011fcae8-62a3-11e5-8ab8-2c27d7e1614f
0126a8b9-62a3-11e5-8ab8-2c27d7e1614f
012d115a-62a3-11e5-8ab8-2c27d7e1614f
0133c81b-62a3-11e5-8ab8-2c27d7e1614f
013a30bc-62a3-11e5-8ab8-2c27d7e1614f

The single groups are alphabetically (but not binary) ordered and taken the different runs as a whole they aren't event alphabetically ordered (sigh).

What am I missing?

************************* EDIT - Description of my implementation ******************

After the various comments and answers I implemented the following strategy:

I generated my own sequential (based on current timestamp) guids and this is the generator class:

package it.hibernate;

import java.io.Serializable;
import java.text.SimpleDateFormat;
import java.util.Date;

import org.apache.commons.lang.RandomStringUtils;
import org.hibernate.HibernateException;
import org.hibernate.engine.SessionImplementor;
import org.hibernate.id.IdentifierGenerator;



public class CustomSequentialGuidGenerator implements IdentifierGenerator{


    @Override
    public Serializable generate(SessionImplementor session, Object object)
            throws HibernateException 
        {

        String uuid = null;
        try {

            Date data = new Date();

             SimpleDateFormat sdf = new SimpleDateFormat(); 
             String rand = RandomStringUtils.randomAlphanumeric(12);

             sdf.applyPattern("yyyy");
             String year = sdf.format(data);

             sdf.applyPattern("MM");
             String month = sdf.format(data);

             sdf.applyPattern("dd");
             String day = sdf.format(data);

             sdf.applyPattern("HH");
             String hour = sdf.format(data);

             sdf.applyPattern("mm");
             String mins = sdf.format(data);

             sdf.applyPattern("ss");
             String secs = sdf.format(data);

             sdf.applyPattern("SSS");
             String millis = sdf.format(data);

             //G carachter is used to insert the rows after
             uuid = "GG" + year + month + "-" + day + hour + "-" + mins + secs + "-" + "0" + millis + "-" + rand;


        } 
        catch (Exception exception) 
        {
            exception.printStackTrace();                
        }

        return uuid;
    }
}

You can note that all rows start with the string 'GG' because I had to make sure that all new rows would be inserted AFTER all the old rows generated via select newid(). After that there's the current timestamp and 12 random chars to avoid collision in case of a multiple row insert in the same millisecond.

After a test of 2000 insert the primary key index fragmentation dropped from 17,92% to 0,15%.

N.B. The data type I reintroduced is obviously a varchar(36) again and not a uniqueidentifier, so the rows are sorted alphabetically.

frankieta
  • 962
  • 1
  • 9
  • 31
  • 1
    It's a shame there doesn't seem to be a Hibernate generator that works like Identity, but for GUIDs. Then you could use a default constraint of NEWSEQUENTIALID on the column on the server side, presumably, and Hibernate would simply read back the value it had provided... (Incidentally, is there a particular reason (replication?) that you need to use GUIDs rather than a simple incrementing IDENTITY column?) Or could you simply consider adding an IDENTITY column and clustering by that, but indexing the GUID? – Matt Gibson Sep 24 '15 at 10:44
  • Thanks Matt! Could you please explain better this part "Then you could use a default constraint of NEWSEQUENTIALID on the column on the server side, presumably, and Hibernate would simply read back the value it had provided" ? I generated the costraint I already have on the column I changed to uniqueidentifier and this is what sql server tells me I have: ALTER TABLE [dbo].[e1_tur_servizi] ADD CONSTRAINT [DF__Tmp_e1_tur_s__id__759D1B5E] DEFAULT (newsequentialid()) FOR [id] GO So it looks that a newsequentialid() costraint is there already, or am I missing something (probable)? – frankieta Sep 24 '15 at 11:19
  • 1
    (Bear in mind I don't know that much about Hibernate, so you shouldn't take word as gospel!) You currently have what I'd say is a correct default constraint on the column; however, as it's a default it will only be used if your insert statement *doesn't include* the `id` column. As Hibernate is generating a GUID and providing it in the INSERT, that's the one you get and your default is ignored. For an IDENTITY column, I *think* Hibernate has a specific generator that will not include that column in an INSERT statement, but instead retrieve the value generated by the server. – Matt Gibson Sep 24 '15 at 11:28
  • Tnx Matt. Yes Hibernate has an option for the IDENTITY column but I guess it would work for int or bigint column I think and mine is a varchar(36). You're totally right about sql server ignoring my default value since it's provided already by hibernate. – frankieta Sep 24 '15 at 12:32

1 Answers1

6

The default option of newsequentialid() of course did not work, because hibernate does not use the default, it always sets a value issued by its generator.

By taking a quick look at the JUG library, it appears that it does not offer any means of generating GUIDs sequentially. I do not know why you thought that the generate() method of the generator obtained via Generators.timeBasedGenerator() would give you sequential GUIDs. A time based generator is simply a generator which takes the current time into account when generating GUIDs, but it is free to mangle the current time coordinate in any way it sees fit when embedding it into the GUID, so it does not guarantee that there will be anything sequential about the resulting GUIDs.

Generally, the terms "GUID" and "sequential" are incompatible with each other. You can either have keys that are GUIDs, or keys that are sequential, but under normal circumstances, you cannot have both.

So, are you sure that the keys must be GUIDs? Personally, I find GUIDs very hard to work with.

But if you must do any hacks necessary as to have sequential GUIDs, then my recommendation would be to write your own function which generates 36-character strings that look like GUIDs, but are sequential.

The sequential part should come from a SEQUENCE, which simply issues sequential integers. (I believe MS-SQL-Server supports them.)

You can read the IETF's UUID specification on how to construct GUIDs properly, but you do not have to follow it to the letter. For the most part, if it simply looks like a GUID, it is good enough.

If you can have a single global sequence for this, that's good. If you cannot have a single global sequence, then you need to somehow identify your sequences, and then take the identifier of each sequence into account when generating your GUIDs. (That would be the "node id" mentioned in the IETF documentation.)

I once had the unreasonable requirement that the rows that I was to transmit to a certain web service had to be identified by GUIDs, and there was too much red tape that was preventing me from contacting them to ask them "are you friggin' serious?" so I just transmitted GUIDs like the following:

|--- random part -----| |-- key ---|
314a9a1b-6295-11e5-8d2c-000000000001
314a9a1b-6295-11e5-8d2c-000000000002
314a9a1b-6295-11e5-8d2c-000000000003
314a9a1b-6295-11e5-8d2c-000000000004
314a9a1b-6295-11e5-8d2c-000000000005
...

They did not say a word.

Mike Nakis
  • 56,297
  • 11
  • 110
  • 142
  • Thanks Mike, we currently use SQL Server 2008 (sorry for not mentioning) and onestly I have to investigate about the SEQUENCE object (available from SS 2012). Thanks for confirming me that JUG did not produced sequential guids, I guess I have to do like you suggested in your edit (a random part and a sequential key). The point now is generating this sequential key globally at runtime before every insert. What can I use? – frankieta Sep 24 '15 at 11:11
  • 2
    First of all, do you have a server cluster or some other multi-node configuration as to actually have a real need for guids? Because if not, then things are simple: use a plain integer identity key column, and if someone must see it as a guid, add a computed column which presents the identity value as a guid. If you must have guids, and you do not have sequences, then you need to emulate a sequence using a table with a single integer identity column, but it will be considerably slower than a sequence. – Mike Nakis Sep 24 '15 at 11:31
  • No, we don't really have a real necessity for guids but at this time they're all over the tables (like 1.000 tables or little less) in our db. We have really many queries that include that column as filter so changing now would mean rewrite A LOT of code. I could add an idenitity column and use that as a clustered index but wouldn't this harm the performance of the queries that use the varchar id since now it is not helped by an index? – frankieta Sep 24 '15 at 12:37
  • 1
    Yes, performance will most probably be harmed. And the way you describe things, even though you do not have a real necessity for GUIDs, it appears that you must stick with GUIDs. – Mike Nakis Sep 24 '15 at 12:53
  • 1
    So, go with the option of generating your own GUIDs. – Mike Nakis Sep 24 '15 at 12:54
  • Ok, I will try to generate my own GUIDs, do you think after a random part I can use the current timestamp to have a guarantee of sequentiality? Thanks again for your time and help. – frankieta Sep 24 '15 at 13:12
  • You can, but you do not need to. Sequentiality is guaranteed by using the number issued by the SEQUENCE (or sequence-implemented-by-an-identity-column-in-another-table since your version of MS-SQL-Server does not support sequences.) – Mike Nakis Sep 24 '15 at 13:17
  • 1
    @frankieta Of course, there's nothing to stop you having a nonclustered unique index on the guid as well as a clustered index on an identity. That should still give decent performance on your current queries. – Matt Gibson Sep 24 '15 at 13:38
  • Yes I know, is just that I don't have extensive time to test a great number of changes on the schema so I wanted to keep things the least different possible from the current (broken) implementation. Thanks Matt – frankieta Sep 24 '15 at 13:47