9

I recently became involved with a new software project which uses SQL Server 2000 for its data storage.

In reviewing the project, I discovered that one of the main tables uses a clustered index on its primary key which consists of four columns:

Sequence  numeric(18, 0)
Date      datetime
Client    varchar(9)
Hash      tinyint

This table experiences a lot of inserts in the course of normal operation.

Now, I'm a C++ developer, not a DB Admin, but my first impression of this table design was that that having these fields as a clustered index would be very detrimental to insert performance, since the data would have to be physically reordered on each insert.

In addition, I can't really see any benefit to this since one would have to be querying all of these fields frequently to justify the clustered index, right?

So basically I need some ammunition for when I go to the powers that be to convince them that the table design should be changed.

ErikE
  • 48,881
  • 23
  • 151
  • 196
Avalanchis
  • 4,500
  • 3
  • 39
  • 48
  • What kind of values are stored in the "Sequence" column? (Cardinality/uniqueness, ascending or random over time, etc.) – Philip Kelley Jul 20 '10 at 18:39
  • The values in the sequence column are essentially random numbers generated by clients external to the DB which are supposed to be unique per each client. – Avalanchis Jul 20 '10 at 18:41
  • 1
    If you are going to go looking for performance problems, you are better off looking in the profiling data than the data structures and existing indexes. Almost every db optimization tip/truism I've seen for at DB has the caveat "Your mileage may vary, do actual performance tests before adopting this approach." – JohnFx Jul 20 '10 at 18:55
  • @JohnFX Truisms aren't globally true. There are plenty of cases where one can look at a table/index design, and without doing performance testing at all instantly know that there's a far superior solution that will give better performance. Which is better: an entire database based on `numeric(18, 0)` ID columns or on `bigint` ID columns? I don't need to test to know that using 8 bytes instead of 9 is a serious improvement. – ErikE Jan 27 '11 at 00:58
  • It is about triage. Sure you can find micro-optimizations like shaving a byte off a field, but why waste your time on it? Look for problems first, then solutions. Not the other way around. – JohnFx Jan 27 '11 at 04:37
  • @JohnFx Why in tarnation would you call changing a column in the clustered index from 9 bytes to 8 a micro-optimization? You don't think an 11% shrinkage in the space used by that column (in the CI and every NCI on the table) is a substantial improvement—and a super easy one, besides? P.S. If you don't tag me (as in @Erik) I won't know you've replied... – ErikE May 04 '11 at 17:25
  • @JohnFx How about a column in the clustered index that's char(100) but the largest value in it is 8 characters? Would you call it a micro-optimization to change it to var/char(8)? No? Then that's something that you can see just from the table design, without needing to do performance testing to "find out" that it's slower when the CI has 82 unnecessary bytes in it... – ErikE May 04 '11 at 17:27
  • @ErikE I would say that you're making a big assumption by saying there's "82 unnecessary bytes" just because the longest **current** value is only 8 characters long. What if there is a high probability that longer values will be stored later? Or even just a possibility of it, but a business requirement that the DB accommodate lengths up to 100 characters? It's important to understand *why* something is setup the way it is, before assuming you can change it. Going back to the OP, I agree that `bigint` is better than `numeric(18,0)`, although I'd question a lot of other things here before that. – Christopher Cashell Jan 30 '12 at 21:55
  • 1
    @Chris Those are valid concerns. Understanding the scope and business intent of data is of course of primary importance--you have to know what the data *means* before you can decide how to accommodate it. Please don't imagine I think otherwise! Also, "the order that one questions things in" has a lot to do with experience. This is also different from the order one *should* question things in--a highly philosophical question. One person could say go after the biggest material improvement first, while another says get the easy (though perhaps smaller) improvements out of the way first. – ErikE Jan 31 '12 at 19:20
  • @Chris I like to clear the obvious and easy gunk out of the way first: it clears my mind to allow the bigger things to really settle in. I find this a wildly effective strategy, and over and over I get better results than others who can't or don't analyze as deeply as I can and do. Last, given that the scenario about char(100) is **my** hypothetical, it's quite ironic you tell me I'm making a big assumption. In *my hypothetical* where a column is char(100), there really are only 8 characters needed. So now that I've spelled it out for you, do you agree reclaiming 82 bytes per row is a big win? – ErikE Jan 31 '12 at 19:24
  • @ErikE Yep, that covers it pretty well. If you have a char(100) column and you are sure you're only going to need 8 characters, that's a worthwhile change. As long as scope and requirements are considered, your points are all entirely valid. – Christopher Cashell Mar 14 '12 at 03:34

5 Answers5

17

The clustered index should contain the column(s) most queried by to give the greatest chance of seeks or of making a nonclustered index cover all the columns in the query.

The primary key and the clustered index do not have to be the same. They are both candidate keys, and tables often have more than one such key.

You said

In addition, I can't really see any benefit to this since one would have to be querying all of these fields frequently to justify the clustered index, right?

That's not true. A seek can be had just by using the first column or two of the clustered index. It may be a range seek, but it's still a seek. You don't have to specify all the columns of it in order to get that benefit. But the order of the columns does matter a lot. If you're predominantly querying on Client, then the Sequence column is a bad choice as the first in the clustered index. The choice of the second column should be the item that is most queried in conjunction with the first (not by itself). If you find that a second column is queried by itself almost as often as the first column, then a nonclustered index will help.

As others have said, reducing the number of columns/bytes in the clustered index as much as possible is important.

It's too bad that the Sequence is a random value instead of incrementing, but that may not be able to be helped. The answer isn't to throw in an identity column unless your application can start using it as the primary query condition on this table (unlikely). Now, since you're stuck with this random Sequence column (presuming it IS the most often queried), let's look at another of your statements:

having these fields as a clustered index would be very detrimental to insert performance, since the data would have to be physically reordered on each insert.

That's not entirely true.

The physical location on the disk is not really what we're talking about here, but it does come into play in terms of fragmentation, which is a performance implication.

The rows inside each 8k page are not ordered. It's just that all the rows in each page are less than the next page and more than the previous one. The problem occurs when you insert a row and the page is full: you get a page split. The engine has to copy all the rows after the inserted row to a new page, and this can be expensive. With a random key you're going to get a lot of page splits. You can ameliorate the problem by using a lower fillfactor when rebuilding the index. You'd have to play with it to get the right number, but 70% or 60% might serve you better than 90%.

I believe that having datetime as the second CI column could be beneficial, since you'd still be dealing with pages needing to be split between two different Sequence values, but it's not nearly as bad as if the second column in the CI was also random, since you'd be guaranteed to page split on every insert, where with an ascending value you can get lucky if the row can be added to a page because the next Sequence number starts on the next page.

Shortening the data types and number of all columns in a table as well as its nonclustered indexes can boost performance too, since more rows per page = fewer page reads per request. Especially if the engine is forced to do a table scan. Moving a bunch of rarely-queried columns to a separate 1-1 table could do wonders for some of your queries.

Last, there are some design tweaks that could help as well (in my opinion):

  • Change the Sequence column to a bigint to save a byte for every row (8 bytes instead of 9 for the numeric).
  • Use a lookup table for Client with a 4-byte int identity column instead of a varchar(9). This saves 5 bytes per row. If possible, use a smallint (-32768 to 32767) which is 2 bytes, an even greater savings of 7 bytes per row.

Summary: The CI should start with the column most queried on. Remove any columns from the CI that you can. Shorten columns (bytes) as much as you can. Use a lower fillfactor to mitigate the page splits caused by the random Sequence column (if it has to stay first because of being queried the most).

Oh, and get your online defragging going. If the table can't be changed, at least it can be reorganized frequently to keep it in best possible shape. Don't neglect statistics, either, so the engine can pick appropriate execution plans.

UPDATE

Another strategy to consider is if the composite key used in the table can be converted to an int, and a lookup table of the values is created. Let's say some combination of less than all 4 columns is repeated in over 100 rows, for example, Sequence + Client + Hash but only with varying Date values. Then an insert to a separate SequenceClientHash table with an identity column could make sense, because then you could look up the artificial key once and use it over and over again. This would also get your CI to add new rows only on the last page (yay) and substantially reduce the size of the CI as repeated in all nonclustered indexes (yippee). But this would only make sense in certain narrow usage patterns.

Now, marc_s suggested just adding an additional int identity column as the clustered index. It is possible that this could help by making all the nonclustered indexes get more rows per page, but it all depends on exactly where you want the performance to be, because this would guarantee that every single query on the table would have to use a bookmark lookup and you could never get a table seek.

About "tons of page splits and bad index fragmentation": as I already said this can be ameliorated somewhat with a lower fill factor. Also, frequent online index reorganization (not the same as rebuilding) can help reduce the effect of this.

Ultimately, it all comes down to the exact system and its unique pattern of data access combined with decisions about which parts you want optimized. For some systems, having a slower insert isn't bad as long as selects are always fast. For others, having consistent but slightly slower select times is more important than having slightly faster but inconsistent select times. For others, the data isn't really read until it's pushed to a data warehouse anyway so the inserts need to be as fast as possible. And adding into the mix is the fact that performance isn't just about user wait time or even query response time but also about server resources especially in the case of massive parallelism, so that total throughput (say, in client responses per time unit) matters more than any other factor.

ErikE
  • 48,881
  • 23
  • 151
  • 196
5

Clustered indexes (CI) work best over ever-increasing, narrow, rarely changing values. You'll want your CI to cover the column(s) that get hit the most often in queries with >=, <=, or BETWEEN statements.

I'm not sure how your data normally gets hit. Most often you'll see a CI on an IDENTITY column or another narrow column (because this column will also be returned "tacked on" to all non-clustered indexes, and we don't want a ton of data added on to every fetch if it isn't needed). It's possible the data might be getting queried most often on date, and that may be a good choice, but all four columns is likely not correct (I stress likely, because I don't know the set-up; this may not have anything wrong with it). There are some pointers here: http://msdn.microsoft.com/en-us/library/aa933131%28SQL.80%29.aspx

Mike M.
  • 12,343
  • 1
  • 24
  • 28
2

There are a few things you are misunderstanding about how SQL creates and uses indexes.

Clustered indexes aren't necessarily physically ordered on disk by the clustered index, at least not in real-time. They are just a logical ordering.

I wouldn't expect a major performance hit based on this structure and removing the clustered index before you have actually identified a performance issue related to that index is clearly premature optimization.

Also, an index can be useful (especially one with several fields in it) even for searches that don't sort or get queried on all columns included in it.

Obviously, there should be a justification for creating a multi-part clustered index, just like any index, so it makes sense to ask for that if you think it was added capriciously.

Bottom line: Don't optimize the indexes for insert performance until you have actually detected a performance problem with inserts. It usually isn't worth it.

JohnFx
  • 34,542
  • 18
  • 104
  • 162
  • Actually - the data is (to a certain degree) physically ordered - and having such a compound clustering key **will** lead to significantly more (computationally expensive) page splits and high index fragmentation. – marc_s Jul 20 '10 at 19:06
  • Perhaps so, by my main point still stands. Optimization should be done based on actual performance data, not theory. – JohnFx Jul 20 '10 at 19:14
  • Sure - always measure and check your improvements. But this is quite an awful Clustering Index, really..... any chance will be an improvement, for sure! – marc_s Jul 20 '10 at 19:33
  • @marc_s: Agreed - but I have a gut feeling based on the names of those fields that they might be used together for frequent sorting. Just a guess though. Point taken though. – JohnFx Jul 20 '10 at 20:53
  • I renamed the fields a bit to protect the innocent. My gut feeling is that they just selected these four columns for the PK because they wanted to ensure the records were unique based on them, and didn't consider the ramifications of making them part of a clustered index. – Avalanchis Jul 20 '10 at 21:20
  • @Avalanchis - you could probably make a case that another field in the table would make a BETTER clusterd index more easily than arguing that this one should not be. Try that angle. – JohnFx Jul 20 '10 at 22:13
1

If you have only that single clustered index on your table, that might not be too bad. However, the clustering index is also used for looking up the real data page for any hit in a non-clustered index - therefor, the clustered index (all its columns) are also part of each and every non-clustered index you might have on your table.

So if you have a few nonclustered indices on your table, then you're definitely a) wasting a lot of space (and not just on disk - also in your server's RAM!), and b) your performance will be bad.

A good clustered index ought to be:

  • small (best bet: a 4-byte INT) - yours is pretty bad with up to 28 bytes per entry
  • unique
  • stable (never change)
  • ever-increasing

I would bet your current setup violates at least two if not more of those requirements. Not following these recommendations will lead to waste of space, and as you rightfully say, lots of page and index fragmentation and page splits (having to "rearrange" the data when an insert happens somewhere in the middle of the clustered index).

Quite honestly: just add a surrogate ID INT IDENTITY(1,1) to your table and make that the primary clustered key - you should see quite a nice boost in performance, just from that, if you have lots of INSERT (and UPDATE) operations going on!

See some more background info on what makes a good clustering key, and what is important about them, here:

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
  • "Quite honestly: just add a surrogate ID INT IDENTITY(1,1) to your table and make that the primary clustered key" Um, I beg to differ. This may help nonclustered indexes pack more rows per page, but is not a good idea if almost any queries contain the Sequence as part of the WHERE clause, because you'll be permanently disabling all table seeks! Also, including columns in the CI can help sometimes if it makes the nonclustered index a covering query so that the CI doesn't have to be hit at all. – ErikE Jul 21 '10 at 00:22
  • @Emtucifor: if the queries really do include the Sequence, it would be a better idea to have those columns involved in a separate, non-clustered index that can cover those most frequent queries. With this CL setup, you'll have tons of page splits and bad index fragmentation - which you can get rid of using a surrogate INT – marc_s Jul 21 '10 at 05:04
  • Okay, I suppose it's possible, but I think major testing is in order with the full panoply of production queries and data in place. I'm not convinced that forcing every single query to a bookmark lookup is a sure way to fix the problems caused by the random key (and the corresponding page splits). – ErikE Jul 22 '10 at 01:09
0

I ultimately agree with Erik's last paragraph:

"Ultimately, it all comes down to the exact system and its unique pattern of data access combined with decisions about which parts you want optimized..."

This is the basic thing I force people to learn: there's no universal solution.

You have to know your data and the actions performed against it. You have to know how frequent different type of actions are and their impact and expected execution times (you don't have to hard tune some rarely executed query and impact everything else if the end user agrees the query execution time is not so important--let's say waiting for few minutes for some report once per week is okay). Of course, as Erik said

"performance isn't just about user wait time or even query response time but also about server resources"

If such a query affects overall server performance, it should be considered as a serious candidate for optimization, even if execution time is fine. I've seen some very fast queries that used huge amount of CPU on multiprocessor servers, while slightly slower solution were incomparable "lighter" from resource utilization point of view. In that case I almost always go for the slower one.

Once you know what is your goal you can decide how many indexes you need and which one should be clustered. Unique constraints, filtered indexes, indexes with included columns are quite powerful tools for tuning. Choosing proper columns is important, but often choosing proper order of columns is even more important. And at the end, don't kill insert/update performance with tons of indexes if the table is frequently modified.

ErikE
  • 48,881
  • 23
  • 151
  • 196
Niikola
  • 1,452
  • 1
  • 10
  • 12
  • Just to clarify: I wasn't so much focusing on a particular query's use of a large quantity of system resources. What I meant was that in a high transaction per second system, while individual response time is important, it's still less important than total resource utilization, if such utilization delays other requests. It's better for 200 clients to get their answers in 0.5 seconds than for 50 to get their answer in 0.001 seconds and the other 150 to wait 1 second. Total throughput (with some governance on max wait time) may be most important. – ErikE Apr 19 '11 at 19:35