0

In star-schema design, dimension tables for fact table(s) are almost a must. In many business cases, the modeler can assume with acceptable confidence that certain dimension values will not change without control. For example the Gender is sometimes used as a dimension where in fact it could as well be a column in the fact table.

I am not sure if people always think about this. It is easier to just add a dimension and not think about it. However, as one of the Kimball rules, there should not bee too many dimensions for a fact table (I think the number he suggested is about 20).

I took gender as an example, but there are many other such as state name, city name, credit card type, etc.

My Question is:

What rule should one use to decide whether to embed/Collapse the value in the fact table vs. having a separate dimension for it? Some of the possible answers are: 1. When it does not change (e.g. gender). 2. When it has few possible values and have short length?

What else?

EDIT

Even though I considered the question answered, I still went to further research. There is a case when you have may have to use a dimension. The case is here: "Fact dimensions are frequently used to support drill through actions because the drill through action in SQL Server Analysis Services (SSAS) requires that you select the attributes from a dimension. So if you users want to see certain fields when they do a drill through, you must have those fields in a dimension."

The above was quoted from here Degenerate Dimensions

I think the subject needs further analysis for the interested person(s).

NoChance
  • 5,632
  • 4
  • 31
  • 45

1 Answers1

1

It sounds like you're describing what Kimball calls "degenerate dimensions" - where you store the dimension value directly in the fact table. Kimball's own take on these is that you use them when you just have the one column for that dimension. It's often something that's very low cardinality with the fact. A good example is something like purchase order - you would have a purchase order fact table, and then have a column called PurchaseOrderReference which is actually a degenerate dimension, rather than having a purchase order dimension that's one-to-one with the fact.

However, the reason to keep the number of dimensions fairly low is to stop the fact table from becoming too wide - and if you start putting your dimension columns into your facts, they will become much wider, and performance will suffer. For instance, if you're adding large string columns or certain number formats, these could be bigger than the dimension key columns - and if you are adding several columns where you could get away with a single dimension key, the impact is much greater.

The solution for this is the denormalisation of dimensions. You mention state and city - these could well go into one location dimension together, then you just have one dimension key from your fact to the location dimension, instead of two (one to city and one to state). Your example of gender could potentially be a candidate for a degenerate dimension, but only if gender is the only attribute about people that you store. If you have a person dimension, or even have several attributes about people (gender, height, hair colour for instance), these would be better off in a dimension together.

You are correct that a lot of people don't design this way - I think this is because many people are used to more normalised databases where separate city and state tables would be normal (and correct), and so they incorrectly create many dimensions with few attributes where a few wider dimensions would be better.

Jo Douglass
  • 2,055
  • 1
  • 19
  • 30
  • Thank you for your detailed answer and I agree with it. Please clarify the part "...and if you start putting your dimension could...will suffer". I think what you mean here is that the length of text could be > length of the FK, which is an interesting point, I had not considered before! Thanks again. – NoChance Jan 10 '17 at 17:26
  • 1
    @NoChance Clarification added - I was thinking particularly about cases where several columns could be replaced by a single dimension key, but yes, even single columns could perform worse depending on data type and size. Kimball does suggest using degenerate dimensions where the dimension would otherwise have one attribute, but he makes an exception for large things like "notes" fields - he suggests chucking those into their own dimension for performance reasons, rather than leaving them on the fact. – Jo Douglass Jan 10 '17 at 19:05
  • Thanks for your explanations. I also remember that at least one OLAP query tool depended on the physical dimension table existence to help the user construct a report. If one would collapse a dimension, such tools may not recognize the column and this could make reporting using such tools suffer. I wanted to know about this point, but I guess that is out of the scope of the current question. – NoChance Jan 11 '17 at 09:10
  • I don't see Gender as a potential candidate for a degenerate dimension- Kimball is clear that it is something that looks like a dimension key, like your purchase order number example. Kimball does not suggest using degenerate dimensions where the dimension would otherwise have one attribute, its for situations where there is no content other than its primary key. Gender is content rather than a key. See https://en.wikipedia.org/wiki/Degenerate_dimension for how the 'one column in a fact instead' is misleading. – Rich Feb 15 '17 at 12:13
  • @Rich I wouldn't choose to model Gender as a degenerate dimension, as noted in my answer - this is generally an attribute on another dimension. In the very isolated example where you only have Gender and no other related attributes to put in a dimension with it though, it would then meet the criteria of having no other content. You just have the name - 'Female', 'Male', etc. If you were to have a Gender dimension (which again, I do not suggest is a good idea), the name of the gender would be the natural key - a natural key is not always an integer. – Jo Douglass Feb 15 '17 at 12:21
  • @Rich When I say "no other columns," I mean "no columns, other than the natural key." I don't see how this differs from your definition of "no content other than its primary key" - unless you are saying primary key when you mean surrogate key? – Jo Douglass Feb 15 '17 at 12:26
  • Although its true that a natural key is not always an integer, and obviously the gender would also be unique in a table that only had gender, Kimball was referring to keys that are IDs, that don't go along with content. What real word entity does a gender represent, that its a key of? Its a descriptive attribute.For attributes that don't have a dimensional home like this Kimball would instead suggest a junk dimension collecting various (otherwise homeless) attributes and flags. – Rich Feb 15 '17 at 13:23
  • @Rich The entity/concept it represents *is* gender - compare to having a City dimension. It represents the concept of a city, and the city name would be its natural key. In the same way, the gender name would be the natural key of a Gender dimension. However, I feel like we're getting away from the point of the post, and long comment threads are discouraged here. This is a very minor point in my answer, where I was actually looking to *discourage* the reader from defining Gender as a dimension - degenerate or otherwise. I would be happy to discuss further in a chat though, if you'd like? – Jo Douglass Feb 15 '17 at 14:01
  • It was good that you discouraged the questioner from defining Gender as a dimension, but by referring it to it as a question of degenerate dimensions was not correct if you intended to have the same meaning as Kimball. – Rich Feb 15 '17 at 20:06