0

Im learning dimensional modeling and Im trying to create a model. I was thinking about a social media platform which rates hotels. The platform has following data:

  • hotel information: name and address
  • a user can rate hotels (1-5 points)
  • a user can write comments
  • platform stores the date of the comments
  • hotel can answer via comment and it stores the date of it
  • the platform stores the total number of each rating level (i.e.: all rates with 1 point, all rates with 2 point etc.)
  • platform stores information of the user: sex, name, total number of votes he/she made and address

First, I tried to define which information belongs to a dimension or fact table
(here I also checked which one is additive/semi additive/non-additive)

I realized my example is kind of difficult, because it’s hard to decide if it belongs to a fact table or dimension.

I would like to hear some advice. Would someone agree with my model?

This is how I would model it:

  • Hotel information -> hotel dimension
  • User rating -> additive fact – because I can aggregate them with all dimensions
  • User comment -> semi additive? – because I can aggregate them with the date dimension (I don’t know if my argument is correct, but I know I would have new comments every day, which is for me a reason to store it in a fact table
  • Answer as comment -> same handling like with the user comments
  • Date of comment-> dimension
  • Total Number of all votes (1/2/3/4/5) -> semi-additive facts – makes no sense to aggregate them, since its already total but I would get the average
  • User information sex and name, address -> user-dimension
  • User Information: total number of votes -> could be dimension or fact. It depends how often it changes. If it changes often, I store it in a fact. If its not that often, then dimension

I still have question, hope someone can help me:

My Question: should I create two date dimensions, or can I store both information in one date dimension?

2nd Question: each user and hotel just have one address. Are there arguments, to separate the address dimension in a own hierarchy? Can I create a 1:1 relationship to a user dimension and address dimension?

qwerty_so
  • 35,448
  • 8
  • 62
  • 86
yellow days
  • 1,053
  • 2
  • 9
  • 11

1 Answers1

1

For your model, it looks well considered, but here are some thoughts:

User comment (and answers to comments): they are an event to be captured (with new ones each day, as you mention) so are factual, with dimensionality of the commenter, type of comment, date, and the measure is at least a 'count' which is additive. But you don't want to store big text in a fact so you would put that in a dimension by itself which is 1:1 with the fact, for situations where you need to query on the comment itself.

Total Number of all votes (1/2/3/4/5) are, as you say, already aggregates, mostly for performance. Totals should be easy from the raw data itself so maybe not worthwhile to store them at all. You might also consider updating the hotel dimension with columns (hotel A has 5 '1' votes and 4 '2' votes) that you'd update as you go on, for easy filtering and categorisation.

User Information: total number of votes: it is factual information about a user (dimension) and it depends on whether you always just want to 'find it out' about a person or whether you are likely to use it to filter other information (i.e. show me all reviews for users who have made 10-20 votes). In that case you might store the total in the user dimension (and/or a banding, like 'number of reviews range' with 10-20, 20-30). You can update dimensions often if you need to, but you're right, it could still just live as a fact only.

As for date dimensions, if the 'grain' is 'day' then you only need one dimension, that you refer to from multiple facts.

As for addresses, you're right that there are arguments on both sides! Many people separate addresses into their own dimension, referred to from the other dimensions that use them. Kimball suggests you can do that behind the scenes if necessary, but prefers for each dimension to have its own set of address columns(but modelled as consistently as possible).

Rich
  • 2,207
  • 1
  • 23
  • 27
  • thanks for the detailed explanation. Your approach is very interesting! I still have questions about your modeling: 1. User Comment: Why not add large texts to the fact table? Is it because of the performance reasons? (for example: fact table load longer?) Or are there other reasons? Because the text length would be long in both tables (fact and dim) – yellow days Mar 02 '18 at 11:28
  • 2. Total Number of all votes: with taking into consideration, do you mean such a dimension table? DIM Hotel with following columns: name | address | 5_Points | 4_Points | 3_Points etc.? Do you mean by updating that the numbers will be replaced or adding new ones, if a number changes? If the value changes often, should not it be added to a fact table? – yellow days Mar 02 '18 at 11:32
  • SO isn't a good place for a long conversation about multiple aspects. – Rich Mar 02 '18 at 11:32
  • so to keep this short, 1) yes text fact should be avoided for performance of the fact. Fact tables should be kept to surrogate keys and measures. And 2) Yes, that was the kind of thing (but depends on your needs). Yes, the numbers will be replaced. You should have fact tables, but depending on your needs, you could also/instead have stats on dimensions for querying too. As always the answer is 'it depends' and normally this depends on how best to meet querying requirements. I can recommend learning from Kimball resources. – Rich Mar 02 '18 at 11:39