What data type is optimal for clustered index of a table published by using transactional replication?

Question

We have an application which stores data in SQL server database. (Currently we support SQL Server 2005 and higher). Our DB has more than 400 tables. The structure of the database is not ideal. The biggest problem is that we have a lot of tables with GUIDs (NEWID()) as Primary CLUSTERED Keys. When I asked our main database architect “why?”, he said: “it is because of the replication”. Our DB should support transactional replication. Initially, all primary keys were INT IDENTITY(1,1) CLUSTERED. But later when it came to replication support, this fields were replaced by UNIQUEIDENTIFIER DEFAULT NEWID(). He said “otherwise it was a nightmare to deal with replication”. NEWSEQUENTIALID() was not supported by SQL 7/2000 at that time. So now we have tables with the following structure:

CREATE TABLE Table1(
        Table1_PID uniqueidentifier DEFAULT NEWID() NOT NULL,
        Field1 varchar(50) NULL,
        FieldN varchar(50) NULL,
        CONSTRAINT PK_Table1 PRIMARY KEY CLUSTERED (Table1_PID)
    )
    GO

CREATE TABLE Table2(
    Table2_PID uniqueidentifier DEFAULT NEWID() NOT NULL,
    Table1_PID uniqueidentifier NULL,
    Field1 varchar(50) NULL,
    FieldN varchar(50) NULL,
    CONSTRAINT PK_Table2 PRIMARY KEY CLUSTERED (Table2_PID),
    CONSTRAINT FK_Table2_Table1 FOREIGN KEY (Table1_PID) REFERENCES Table1 (Table1_PID)
)
GO

All the tables actually have a lot of fields (up to 35) and up to 15 non-clustered indexes.

I know that a GUID that is not sequential - like one that has it's values generated in the client (using .NET) OR generated by the NEWID() SQL function (like in our case) is a horribly bad choice to be clustered index for two reasons:

fragmentation
size

I also know that A GOOD clustering key is that it is:

unique,
narrow,
static,
ever-increasing,
non-nullable,
and fixed-width

For more details on the reasons behind this, check out the following great video: http://technet.microsoft.com/en-us/sqlserver/gg508879.aspx.

So, INT IDENTITY really is the best choice. BIGINT IDENTITY is also good, but typically an INT with 2+ billion rows should be sufficient for the vast majority of tables.

When our customers began suffering from fragmentation, it was decided to make primary keys NON-clustered. As a result, those tables remained without a clustered index. In other words, those tables were turned into HEAPS. I personally don’t like this solution because I am sure that heap tables are not part of a good database design. Please, check this SQL Server Best Practices Article: http://technet.microsoft.com/en-us/library/cc917672.aspx.

Currently we consider two options to improve the database structure:

The first option is to replace DEFAULT NEWID() by DEFAULT NEWSEQUENTIALID() for the Primary clustered key:

CREATE TABLE Table1_GUID (
  Table1_PID uniqueidentifier DEFAULT NEWSEQUENTIALID() NOT NULL,
  Field1 varchar(50) NULL,
  FieldN varchar(50) NULL,
  CONSTRAINT PK_Table1 PRIMARY KEY CLUSTERED (Table1_PID)
)
GO

The second option is to add INT IDENTITY column to each table and make it the CLUSTERED UNIQUE index, leaving primary key NOT clustered. So the Table1 will look like:

CREATE TABLE Table1_INT (
  Table1_ID int IDENTITY(1,1) NOT NULL,
  Table1_PID uniqueidentifier DEFAULT NEWSEQUENTIALID() NOT NULL,
  Field1 varchar(50) NULL,
  FieldN varchar(50) NULL,
  CONSTRAINT PK_Table1 PRIMARY KEY NONCLUSTERED (Table1_PID),
  CONSTRAINT UK_Table1 UNIQUE CLUSTERED (Table1_ID)
)
GO

Table1_PID will be used for replication, (that’s why we left it as PK), while Table1_ID will not be replicated at all.

The long story short, after we run benchmarks to see which approach is better, we found that both solutions are not good:

The first approach (Table1_GUID) revealed the following shortcomings: although sequential GUID's are definitely a lot better than regular random GUIDs, they are still four times larger than an INT (16 vs 4 byte) and this is a factor in our case because we have lots of rows in our tables (up to 60 million), and lots of non-clustered indexes on that tables (up to 15). The clustering key is being added to each and every non-clustered index, so that significantly increases the negative effect of having 16 vs. 4 bytes in size. More bytes means more pages on disk and in SQL Server RAM and thus more disk I/O and more work for SQL Server.

To be more precise, after I inserted 25mln rows of real data to each table and then created 15 non-clustered indexes on each table, I saw a big difference in the space used by the tables:

EXEC sp_spaceused 'Table1_GUID' -- 14.85 GB
EXEC sp_spaceused 'Table1_INT' -- 11.68 GB

Furthermore, the test showed that INSERTs into Table1_GUID were a bit slower than to Table1_INT.

The second approach (Table1_INT) revealed that in most queries (SELECT) joining two tables on Table1_INT.Table1_PID = Table2_INT.Table1_PID execution plan became worse because additional Key Lookup operator appeared.

Now the question: I believe there should be a better solution for our problem. If you could recommend me something or point me to a good resource, I would appreciate it greatly. Thank you in advance.

Updated:

Let me give you an example of a SELECT statement where additional Key Lookup operator appears:

--Create 2 tables with int IDENTITY(1,1) as CLUSTERED KEY.
--These tables have one-to-many relationship.
CREATE TABLE Table1_INT (
    Table1_ID int IDENTITY(1,1) NOT NULL,
    Table1_PID uniqueidentifier DEFAULT NEWSEQUENTIALID() NOT NULL,
    Field1 varchar(50) NULL,
    FieldN varchar(50) NULL,
    CONSTRAINT PK_Table1_INT PRIMARY KEY NONCLUSTERED (Table1_PID),
    CONSTRAINT UK_Table1_INT UNIQUE CLUSTERED (Table1_ID)
)
GO

CREATE TABLE Table2_INT(
    Table2_ID int IDENTITY(1,1) NOT NULL,
    Table2_PID uniqueidentifier DEFAULT NEWSEQUENTIALID() NOT NULL,
    Table1_PID uniqueidentifier NULL,
    Field1 varchar(50) NULL,
    FieldN varchar(50) NULL,
    CONSTRAINT PK_Table2_INT PRIMARY KEY NONCLUSTERED (Table2_PID),
    CONSTRAINT UK_Table2_INT UNIQUE CLUSTERED (Table2_ID),
    CONSTRAINT FK_Table2_Table1_INT FOREIGN KEY (Table1_PID) REFERENCES Table1_INT (Table1_PID)
)
GO

And create other two tables for comperison:

--Create the same 2 tables, BUT with uniqueidentifier NEWSEQUENTIALID() as CLUSTERED KEY.
CREATE TABLE Table1_GUID (
    Table1_PID uniqueidentifier DEFAULT NEWSEQUENTIALID() NOT NULL,
    Field1 varchar(50) NULL,
    FieldN varchar(50) NULL,
    CONSTRAINT PK_Table1_GUID PRIMARY KEY CLUSTERED (Table1_PID),
)
GO

CREATE TABLE Table2_GUID(
    Table2_PID uniqueidentifier DEFAULT NEWSEQUENTIALID() NOT NULL,
    Table1_PID uniqueidentifier NULL,
    Field1 varchar(50) NULL,
    FieldN varchar(50) NULL,
    CONSTRAINT PK_Table2_GUID PRIMARY KEY CLUSTERED (Table2_PID),
    CONSTRAINT FK_Table2_Table1_GUID FOREIGN KEY (Table1_PID) REFERENCES Table1_GUID (Table1_PID)
)
GO

Now run the following select statements and look at the execution plan to compare:

SELECT T1.Field1, T2.FieldN
FROM Table1_INT T1 
    INNER JOIN Table2_INT T2 
        ON T1.Table1_PID = T2.Table1_PID;

SELECT T1.Field1, T2.FieldN
FROM Table1_GUID T1 
    INNER JOIN Table2_GUID T2 
        ON T1.Table1_PID = T2.Table1_PID;

Execution plan

What is your *replication* topology? One publisher-one subscriber? Many subscribers? Many publishers? Do susbscribers ever update the data? Does subscribers's updates propagate back to publisher? — Remus Rusanu, May 18 '13 at 17:23
Well, the replication topology is not simple. As I mentioned above we have a lot of tables. These tables are divided let’s say into 2 groups. One group is published on the main server (Publisher) and has several subscribers (local servers with pull subscription). The second group of tables published on local servers and has one subscriber (push subscription) – the main server. I am sorry, I have to divide my message into several due to the length constraint of the comments... (see my next comment). — Alex, May 18 '13 at 19:50
The second group probably requires more explanation. Subset of tables from the second group is replicated from one local server to another local server. BUT we don’t have direct replication between local severs. Instead, the data first replicated from one local server to the main server and then trigger on the main server insert the data to another table which is published on this main server and replicated to another local server. There is a special logic in the trigger and stored procedures used for replication which makes sure that the data go to the right direction. To be continued... — Alex, May 18 '13 at 19:50
Regarding updates, the data can be changed (inserted, updated) on the publisher(s) as well as on the subscriber(s). And as far as I know, subscribers’ updates never propagate back to the publisher(s). — Alex, May 18 '13 at 19:51

score 2 · Answer 1 · answered May 18 '13 at 17:16

I personally use INT IDENTITY for most of my primary and clustering keys.

You need to keep apart the primary key which is a logical construct - it uniquely identifies your rows, it has to be unique and stable and NOT NULL. A GUID works well for a primary key, too - since it's guaranteed to be unique. A GUID as your primary key is a good choice if you use SQL Server replication, since in that case, you need an uniquely identifying GUID column anyway.

The clustering key in SQL Server is a physical construct is used for the physical ordering of the data, and is a lot more difficult to get right. Typically, the Queen of Indexing on SQL Server, Kimberly Tripp, also requires a good clustering key to be unique, stable, as narrow as possible, and ideally ever-increasing (which a INT IDENTITY is).

See her articles on indexing here:

and also see Jimmy Nilsson's The Cost of GUIDs as Primary Key

A GUID is a really bad choice for a clustering key, since it's wide, totally random, and thus leads to bad index fragmentation and poor performance. Also, the clustering key row(s) is also stored in each and every entry of each and every non-clustered (additional) index, so you really want to keep it small - GUID is 16 byte vs. INT is 4 byte, and with several non-clustered indices and several million rows, this makes a HUGE difference.

In SQL Server, your primary key is by default your clustering key - but it doesn't have to be. You can easily use a GUID as your NON-Clustered primary key, and an INT IDENTITY as your clustering key - it just takes a bit of being aware of it.

I fully agree with you and K. Tripp that INT IDENTITY is the best candidate to be UNIQUE CLUSTERED index. And I would like to add two more links to your collection: http://www.sqlskills.com/BLOGS/PAUL/post/Clustered-or-nonclustered-index-on-a-random-GUID.aspx http://sqlserverperformance.wordpress.com/2010/03/22/why-uniqueidentifier-is-a-bad-choice-for-a-clustered-index-in-sql-server/ However, in my case the situation is complicated by the fact that the tables must be replicated. And if I apply the second approach (Table1_INT), performance of SELECT statements is degraded, which is my concern. — Alex, May 18 '13 at 21:26

What data type is optimal for clustered index of a table published by using transactional replication?

1 Answers1