5

Say I have a table with a large number of rows and one of the columns which I want to index can have one of 20 values. If I were to put an index on the column would it be large?

If so, why? If I were to partition the data into the data into 20 tables, one for each value of the column, the index size would be trivial but the indexing effect would be the same.

Mr. Flibble
  • 26,564
  • 23
  • 69
  • 100
  • The indexing effect would be the same, but what about when you want a second index? – WW. Mar 26 '09 at 01:27

9 Answers9

7

It's not the indexes that will suck. It's putting indexes on the wrong columns that will suck.

Seriously though, why would you need a table with a single column? What would the meaning of that data be? What purpose would it serve?

And 20 tables? I suggest you read up on database design first, or otherwise explain to us the context of your question.

Jon Limjap
  • 94,284
  • 15
  • 101
  • 152
  • I've seen a database with a separate table for each attribute of the actual entities. Why: they want version history and time-travel for every attribute. Imagine that database with 300 tables, where most fields are of the type "DateTime"... – thijs Mar 25 '09 at 15:50
  • @thijs but you would still require two columns, one as the key and one as the attribute – Nathan Koop Mar 25 '09 at 16:03
  • 1
    I phrased that badly. There is one column which I want to index, not one column in total. I'll edit my question with more details of the table structure. – Mr. Flibble Mar 26 '09 at 10:08
  • Single column tables are extraordinarily useful. Consider a system with 1,000,000 customers, of which about 250 are on credit hold at any given time. What would you prefer — an OnCreditHold column in Customers, or an OnHoldCustomers table with only a CustomerID column and 250 rows? – Larry Lustig Mar 03 '10 at 16:36
7

Indexes (or indices) don't suck. A lot of very smart people have spent a truly remarkable amount of time of the last several decades ensuring that this is so.

Your schema, however, lacking the same amount of expertise and effort, may suck very badly indeed.

Partitioning, in the case described is equivalent to applying a clustered index. If the table is sorted otherwise (or is in arbitrary order) then the index necessarily has to occupy much more space. Depending on the platform, a non-clustered index may reduce in size as the sortedness of the rows with respect to the indexed value increases.

YMMV.

Mike Woodhouse
  • 51,832
  • 12
  • 88
  • 127
  • Good! I suspected this partitioning was like using a clustered index. This leads me to the question: is there any value to partitioning the table myself over using a clustered index? I think that the performance hit would be minimal on inserts if I only need to add a bit of code to choose the correc – Mr. Flibble Mar 26 '09 at 12:12
  • correct table to insert into. Would there be a greater performance hit if I used a clustered index? Does data have to get shifted about a lot on each insert where there is a clustered index - or is it smarter than that? – Mr. Flibble Mar 26 '09 at 12:14
  • A table with a clustered index is (by definition) sorted on the indexed columns. So inserting across all values is probably going to cost. It could actually be worse with a partitioned table, though - you'd have to suck it and see. Don't forget to try a non-clustered index in the comparison, either! – Mike Woodhouse Mar 26 '09 at 13:53
3

No indexes don't suck, but you have to pay attention to how you use them or they can backfire on the performance of your queries.

First: Schema / design
Why would you create a table with only one column? That's probably taking normalization one step to far. Database design is one of the most important things to consider in optimizing performance

Second: Indexes
In a nutshell the indexes will help the database to perform a binary search of your record. Without an index on a column (or set of columns) the database will often fall back to a table scan. A table scan is very expensive because it involves enumerating each and every record.

It doesn't really matter THAT much for index scans how many records there are in the database table. Because of the (balanced) binary tree search doubling the amount of records will only result in one extra search step.

Determine the primary key of your table, SQL will automatically place a clustered index on that column(s). Clustered indexes perform really well. In addition you can place non-clustered indexes on columns that are used often in SELECT, JOIN, WHERE, GROUP BY and ORDER BY statements. Do remember that indexes have a certain overlap, try to never include your clustered index into a non-clustered index.

Also interesting might be the fill factor on the indexes. Do you want to optimize your table for reads (high fill factor - less storage, less IO) or for writes (low fill factor more storage, less rebuilding your database pages).

Third: Partitioning
One of the reasons to use partitioning is to optimize your data access. Let's say you have 1 million records of which 500,000 records are no longer relevant but stored for archiving purposes. In this case you could decide to partition the table and store the 500,000 old records on slow storage and the other 500,000 records on fast storage.

To measure is to know
The best way to get insight in what happens is to measure what happens to your cpu and io. Microsoft SQL server has some tools like the Profiler and Execution plans in Management Studio that will tell you the duration of your query, number of read/writes and cpu usage. Also the execution plan will tell you which or IF indexes are being used. To your surprise you might see a table scan although you didn't expect it.

Yvo
  • 18,681
  • 11
  • 71
  • 90
3

The short answer: Do indexes suck: Yes and No

The longer answer: They don't suck if used properly. Maybe you should start reading about how indexes work, why they can work and why they sometimes don't work.

Good starting points: http://www.sqlservercentral.com/articles/Indexing/

thijs
  • 3,445
  • 1
  • 27
  • 46
2

Say I have a table with a large number of rows and one column which I want to index can have one of 20 values. If I were to put an index on the column would it be large?

The index size will be proportional to the number of your rows and the length of the indexed values.

The index keeps not only the indexed value, but also some kind of a pointer to the row (ROWID in Oracle, LCID in PostgreSQL, primary key in InnoDB etc).

If you have 10,000 rows and a 1 distinct value, you will still have 10,000 records in your index.

If so, why? If I were to partition the data into the data into 20 tables, one for each value of the column, the index size would be trivial but the indexing effect would be the same

In this case, you would come with 20 indexes being same in size in total as your original one.

This technique is sometimes used in fact in such called partitioned indexes. It has its advantages and drawbacks.

Quassnoi
  • 413,100
  • 91
  • 616
  • 614
  • In Oracle, the COMPRESS option on index creation can reduce the need to have multiple copies of the same indexed value represented in the index. However, you still need all the rowids. – WW. Mar 26 '09 at 01:27
  • My point is that if I partition into 20 tables then I wouldn't need any index in the column, as I know that every row of the column has the same value. – Mr. Flibble Mar 26 '09 at 12:02
  • If you partition into 20 tables you don't even need the column – Quassnoi Mar 26 '09 at 14:55
2

Standard b-tree indexes are best suited to fairly selective indexes, which this example would not be. You don't say what DBMS you are using; Oracle has another type of index called a bitmap index which is more suited to low-selectivity indexes in OLAP environments (since these indexes are expensive to maintain, making them unsuitable for OLTP environments).

The optimiser will decide bases on stats whether it thinks the index will help get the data in the fastest time; if it won't, the optmiser won't use it.

Partitioning is another strategy. In Oracle you can define a table as partitioned on some set of columns, and for the optimiser can automatically perform "partition elimination" like you suggest.

Tony Andrews
  • 129,880
  • 21
  • 220
  • 259
  • FYI: Table partitioning (spread data over files) based on the contents of columns is also possible in MSSQL 2005 and up – thijs Mar 25 '09 at 15:54
1

Sorry, I'm not quite sure what you mean by "large".

  • If your index is clustered, all the data for each record will be on the same leaf page, thereby creating the most efficient index available to your table as long as you write your queries against it properly.

  • If your index is non-clustered, then only the index related data will be on your leaf pages. Then, depending on suchs things as how many other indexes you have, coupled with details like your fill factor, your index may or may not be efficient. In general, if you don't have a ton of indexes on your table, you should be safe.

  • The efficiency of your index will also be determined by the data type of the 20 values you're speaking of going into the column. If those are pre-defined values, then their details should probably be in a lookup table with a simple primary key datatype (like Int/Number). Then add that column to your table as a foreign key with an index on the column.

Ultimately, you could have a perfect index on a column. But it's best use will be determined for the most part by the queries you write. So if your queries make use of the indexes, you're golden.

TheHolyTerrah
  • 2,859
  • 3
  • 43
  • 50
  • The table has 600 million rows. There are about 5 columns, all but one of which are used for select filtering and one which is the data column. But, for the sake of this question we could say there are 3 columns. Col1, Col2, Col3. Say Col1 is the PK and col2 has 20 possible values and col3 is the – Mr. Flibble Mar 26 '09 at 12:07
  • data column. It seems to me that there is something wrong if the index on Col2 is massive - as I can roll my own index by partitioning into 20 tables, 1 per Col2 value. – Mr. Flibble Mar 26 '09 at 12:08
  • 1
    At 600M rows, I hope you're talking about an OLAP table, not an OLTP table. That's a lot of rows to be managing! Now you're getting into serious warehouse DB architecture theory that would have to take into consideration many other factors of your database. I'd love to hear of your final decision. – TheHolyTerrah Mar 27 '09 at 20:23
0

Indexes are purely for performance. If an index doesn't boost performance for the queries you're interested in, then it sucks.

As for disk usage, you have to weigh your concerns. Different SQL providers build indexes differently, but as a client, you generally trust that they do the best that can be done. In the case you're describing, a clustered index may be optimal for both size and performance.

harpo
  • 41,820
  • 13
  • 96
  • 131
  • " If an index doesn't boost performance for the queries you're interested in, then it sucks." I beg to differ. I do agree, if the index serves no purpose, it is just extra overhead. But the purpose may be much wider than the query or queries you are currenlty examining. – HLGEM Mar 25 '09 at 18:13
  • You're right... I overstated a bit. After posting I thought, it might be designed for future data scenarios. – harpo Mar 25 '09 at 18:43
0

It would be large enough to hold those values for all the rows, in a sorted order.

Say you have 20 different strings of 4 characters, and 1 million rows, it would at least be 4 million bytes (or 8 if 16-bit unicode) to hold those values.

Lasse V. Karlsen
  • 380,855
  • 102
  • 628
  • 825
  • Well, not necessarily. If all the rows on a page had the same column value, for example, a smart indexing engine might be able to use less space by recording that fact instead. IMHO of course, I could easily be wrong... – Mike Woodhouse Mar 25 '09 at 21:25