0

First project using star schema, still in planning stage. We would appreciate any thoughts and advice on the following problem.

We have a dimension table for "product features used", and the set of features grows and changes over time. Because of the dynamic set of features, we think the features cannot be columns but instead must be rows.

We have a fact table for "user events", and we need to know which product features were used within each event.

So it seems we need to have a primary key on the fact table, which is used as a foreign key within the dimension table (exactly the opposite direction from a conventional star schema). We have several different dimension tables with similar dynamics and therefore a similar need for a foreign key into the fact table.

On the other hand, most of our dimension tables are more conventional and the fact table can just store a foreign key into these conventional dimension tables. We don't like that this means that some joins (many-to-one) will use the dimension table's primary key, but other joins (one-to-many) will use the fact table's primary key. We have considered using the fact table key as a foreign key in all the dimension tables, just for consistency, although the storage requirements increase.

Is there a better way to implement the keys for the "dynamic" dimension tables?

Here's an example that's not exactly what we're doing but similar:

  • Suppose our app searches for restaurants.

  • Optional features that a user may specify include price range, minimum star rating, or cuisine. The set of optional features changes over time (for example we may get rid of the option to specify cuisine, and add an option for most popular). For each search that is recorded in the database, the set of features used is fixed.

  • Each search will be a row in the fact table.
  • We are currently thinking that we should have a primary key in the fact table, and it should be used as a foreign key in the "features" dimension table. So we'd have:

    fact_table(search_id, user_id, metric1, metric2)
    feature_dimension_table(feature_id, search_id, feature_attribute1, feature_attribute2)
    user_dimension_table(user_id, user_attribute1, user_attribute2)

  • Alternatively, for consistent joins and ignoring storage requirements for the sake of argument, we could use the fact table's primary key as a foreign key in all the dimension tables:

    fact_table(search_id, metric1, metric2) /* no more user_id */
    feature_dimension_table(feature_id, search_id, feature_attribute1, feature_attribute2)
    user_dimension_table(user_id, search_id, user_attribute1, user_attribute2)

  • What are the pitfalls with these key schemas? What would be better ways to do it?

user1020872
  • 111
  • 1
  • 9
  • If people can address either of the two sub-questions (1. what's wrong with our idea, 2. what's a better way), that would still be very helpful. – user1020872 Feb 12 '16 at 16:44

1 Answers1

2

You need a Bridge table, it is the recommended solution for many-to-many relationships between fact and dimension.

http://www.kimballgroup.com/data-warehouse-business-intelligence-resources/kimball-techniques/dimensional-modeling-techniques/multivalued-dimension-bridge-table/

Edit after example added to question:

OK, maybe it is not a bridge, the example changes my view.

A fundamental requirement of dimensional modelling is to correctly identify the grain of your fact table. A common example is invoice and line-item, where the grain is usually line-item.

Hypothetical examples are often difficult because you can never be sure that the example mirrors the real use case, but I think that your scenario might be search-and-criteria, and that your grain should be at the criteria level.

For example, your fact table might look like this:

fact_search (date_id,time_id,search_id,criteria_id,criteria_value)

Thinking about the types of query I might want to do against search data, this design is my best choice. The only issue I see is with the data type of criteria_value, it would have to be a choice/text value, and would definitely be non-additive.

Ron Dunn
  • 2,971
  • 20
  • 27
  • Wouldn't a bridge table involve more joins and be slower than just using a foreign key in the dimension table? The dynamic dimension table would be many-to-one with the fact table. What exactly are the pitfalls of using a foreign key in the dimension table? Thanks! – user1020872 Feb 12 '16 at 11:47
  • You're prematurely optimising, which a wise man once called the root of all evil :) If you use this method, your analytics will just work. EDIT: (that was a little glib) can you edit your question to provide some examples of data? It will help to explain the scenario. – Ron Dunn Feb 12 '16 at 11:50
  • So what you'd propose is a fact table that has several rows for each search (for example, if the user opted for three features, there'd be three rows for that search in the fact table)? Then to count searches, we'd have to count distinct in the fact table? Not sure we see how that's better than our original proposal of a foreign key in each dimension table. We're still confused. – user1020872 Feb 12 '16 at 16:42
  • It would get rid of the foreign key in the dimension table, but how is this schema practically/functionally better than our original schema? – user1020872 Feb 12 '16 at 16:50
  • If you do this, you are not building a star schema with facts and dimensions. There are rules in the methodology, designed for usability and accuracy, and your first suggestion breaks the rules. A dimension-fact relationship is 1:M, unless M:M through a bridge table. Off the top of my head, I can't tell you exactly what will go wrong, but I can guarantee you that _something_ will go wrong in the ETL and/or query usage. My best recommendation is that you list all the user queries, and test them against both designs. – Ron Dunn Feb 12 '16 at 22:36
  • Thanks for your advice. We are likewise very suspicious of straying from the rules. We're still trying to get some understanding of the problems we might create and how best to handle these "dynamic" tables. Thanks again! – user1020872 Feb 13 '16 at 06:38
  • @RonDunn even though this is fairly old now, I found it's similar to a challenge I've been facing, I couldn't make the exact analogy to my current problem. From what I understand, the bridge table would remove the uniqueness of my fact (if the uniqueness is only guaranteed by the dimensions), which makes me wonder if this is the way to go. I posted a question here (https://stackoverflow.com/questions/72560317/dealing-with-dynamic-dimensions-for-metrics-data-analytics), any inputs are appreciated, thanks! – lowercase00 Jun 13 '22 at 21:08