1

Consider a fact table of the form:

CREATE TABLE Fact1
(
    Dim1 int NOT NULL,
    Dim2 int NOT NULL,
    Dim3 int NOT NULL,
    Data1 int NOT NULL,
    Data2 int NOT NULL
    ...
)

Fact1 has a single column index on each of the dimensions. Dim1 is assumed to be the time dimension with a granularity down to range of hours (e.g. between 2 PM and 6 PM on March 12 2011). Would it be useful to include Dim2 and Dim3 as covering columns within Dim1? Or likewise on any of them?

More generally, would it ever be useful to include the other dimension table FK columns as a covering column on an index for a given dimension?

Note: For the fact table, we are assuming there is no need to uniquely identify a given fact. Hence, the lack of a primary key or surrogate key. The uniqueness is guaranteed by (Dim1, Dim2, Dim3) always being a unique tuple.

Mike Bailey
  • 12,479
  • 14
  • 66
  • 123
  • 1
    Covering indices are used for when you want to get the data just from the index without going to the table itself. How often are you querying this table for data only on Dim2 and only on Dim3, and conversely, when you are querying for Dim1 and also want data from Dim2 and Dim3? – N West Jun 15 '12 at 18:43
  • @NWest: There is no "most common query" really. There are several that are either solely on any one of the dimensions, or any combination of them. – Mike Bailey Jun 15 '12 at 18:46
  • 1
    Wait, `Dim1` is a _range_ of time? Why? And what are `Dim2` and `Dim3` then? Really, whether you want an index or not is going to be based on profiling and query patterns. And you _have_ a unique key - it's (`Dim1`, `Dim2`, `Dim3`) - you may want to make this a primary key (although, I'm not a DBA, so...) – Clockwork-Muse Jun 15 '12 at 18:47
  • 1
    @X-Zero usually in a warehouse environment integrity is enforced by the ETL process, not the DBMS. – N West Jun 15 '12 at 18:48
  • @X-Zero: `Dim1` is a range of time because the quantity being modeled is something over a period of time. It's senseless to consider anything finer than that because anything finer than a few hours is useless. Also @NWest got down the integrity part. – Mike Bailey Jun 15 '12 at 18:49
  • @MikeBantegui If you're using a DBMS that supports skip scans on their indexes (like oracle), then it's probably useful to include Dim2 and Dim3 in your B-Tree index. I almost always advocate for Bitmap indexes on dimensions however, since your DBMS can then use bitmap logical operations to select the appropriate rows. – N West Jun 15 '12 at 18:50
  • If you're on SQL and you're building covering indexes, why not just index-organize the fact table (clustered index) by the most used Dim first? I'm assuming you're building covering indexes to allow quick access for COUNT() queries? – N West Jun 15 '12 at 18:55
  • @NWest: It's hard to determine how to build that clustered index at this moment. There's a very even mix of queries which use one, two, or all of the dimensions and in different ways. I asked my question as general as possible because there's no one specific query that I need to optimize for. Rather, I have to optimize for a very broad range of queries, all of which are more or less equally likely. – Mike Bailey Jun 15 '12 at 19:00

1 Answers1

4

I'm going to try to answer the more general question - "Would it ever be useful to include the other dimension table FK columns as a covering column on an index for a given dimension?"

Yes. If you have a significant number of queries which do things such as COUNT(), where a covering index allows you to scan a smaller data set, then adding those other dimensions may be valuable.

SELECT Dim1, Dim2, count(*)
from Fact1
group by Dim1, Dim2

With an index on only Dim1 or only Dim2, you end up having to do a FTS to do this count. This may be perfectly fine. Full scans are not always bad. However, if you want to speed up these sorts of queries (say the fact table is very wide), then adding a B-tree index on Dim1, Dim2 would allow the DBMS to go to the index to count, instead of having to go to the table to count. Note that it still will do a full scan of the index which may be only marginally faster than a full table scan.

In general, I doubt you would see that much of a performance gain since you are still scanning all the rows of the index anyway, and unless the index was significantly smaller than the table you're probably not going to get a big improvement.

Since it's a fact table, the only queries where covering indexes on dimensions will help is when it's only the dimensions themselves are being queried. Anything that uses the facts will require an index scan, then a lookup in the table for the actual data.

I would probably just build your B-tree indexes on the dims for queries that use the keys (and joins) and then add additional ones as needed when the system has been running for awhile and common queries have been identified.

The other case that I can think of where a "covering" index such as this may help speed up queries is when you have queries that are focusing on the a specific dimension combination, and you only want those specific rows.

SELECT Dim1, Dim2, Data1, Data2
  FROM Fact1 
 WHERE Dim1 = @A and Dim2 = @B;

You may see a very slight performance gain if you have a b-tree index on Dim1, Dim2 rather than just Dim1, since you scan the index for all of the items in the WHERE clause, and then get your fact data.

N West
  • 6,768
  • 25
  • 40
  • Thanks. That was more or less what I was looking for. There are some specifics but I wanted to know the general reason why or why not you may using the covering index for this particular schema type. – Mike Bailey Jun 15 '12 at 19:14
  • Added one other case when this "covering" index may help speed up queries (though it doesn't really act as a "covering" index in this case. – N West Jun 16 '12 at 16:26
  • Note that I have evolved in my thinking and would not necessarily recommend a Kimball-style approach to anything. However, as you will hear that from the majority of data warehouse people, I'll leave the answer as is. – N West Feb 03 '16 at 14:25