0

I am currently designing a star schema for a reporting database where an online product's performance is measured. The challenge is, that I receive information which is in principle measuring the same facts (visits, purchases) and has the same dimensions (user gender, user age, day) but with varying granularity depending on the source, for example, given a total of 10 visits:

  • Source A returns a single line per day for the performance in the format:
    1. Visits, Purchases, Gender, Age Range, Day (total visits = 15)
  • Source B returns two lines for a single day as it does not allow the combination of gender and age:
    1. Visits, Purchases, Gender, Day (total visits = 10)
    2. Visits, Purchases, Age Range, Day (total visits = 10)

The issues is, if I store them in the same fact table, I will have incorrect values when applying aggregate functions:

Day Visits Age Gender Source
19/04/2022 5 18-24 Male A
19/04/2022 10 18-24 Female A
19/04/2022 2 NULL Male B
19/04/2022 8 NULL Female B
19/04/2022 10 18-24 NULL B

(The sum of the visits column would count 20 for source B even though we only have 10 visits for this source, they just appear double due to the different data structure)

Is there a best practice for cases where dimensions and facts are generally the same, but the raw data granularity is different?

  • So for source B, does 1+2 = the true total visits, or is it 2 x visits?. Either of those are pretty wierd outcomes. If 1+2 = total visits, then the most straightforward solution is to assign a value of "unknown" to the missing dimension. This rolls up correctly and truly represents the data. – Nick.Mc Apr 20 '22 at 00:12
  • For source B, the true total visits are 10 in my example (assuming this is all the data we have and only 18-24 year olds have visited). Source B does not give us true "raw" data but pre-aggregated, so all the gender-related data will sum up to 10 and all the age-related data will sum up to 10, the total number of visits as well. – Muellllkreis Apr 20 '22 at 07:42
  • I've updated the question to clarify this. – Nick.Mc Apr 20 '22 at 08:09

1 Answers1

1

Is there a best practice for cases where dimensions and facts are generally the same, but the raw data granularity is different?

You typically can only present the combined data at a grain that's compatible with all the sources, so (Day), (Age,Day), or (Gender,Day).

Alternatively you could "allocate" the Source B data, say applying the gender split for the day to each age group. The totals would work, but the drilldown wouldn't be meaningful.

David Browne - Microsoft
  • 80,331
  • 6
  • 39
  • 67
  • Could you elaborate a bit on the second part of your answer, I am not sure I understand. You suggest to, for example, break down the last line in my example to two lines: 18-24, Male, 2 Visits and 18-24, Female, 8 Visits? – Muellllkreis Apr 20 '22 at 07:46
  • Yes. That’s the idea. If there was another age group each would be 1 Male, 4 Female. – David Browne - Microsoft Apr 20 '22 at 11:38
  • 1
    Okay, thank you. I think in terms of my question (re "best practices") the first part of your answer seems to be the most relevant, i.e. "only store data at the same grain in one table if you want to be coherent". – Muellllkreis Apr 21 '22 at 07:33