Design a dimension with multiple data sources

Question

I am designing a few dimensions with multiple data sources and wonder what other people have done to align the multiple business keys per data source.

My Example: I have 2 data sources - the Ordering System and the Execution System. The Ordering system has details about payment and what should happen; the Execution System has details on what actually happened (how long it took etc, who executed on the order). Data from both systems is need to created a single fact.

In both the Ordering and Execution system they is a Location table. The business keys from both systems are mapped via an esb . There are attributes in both systems that make up the complete picture about a single location. Billing information is in the Ordering system, latitude and longitude are in the Execution system. And Location Name exists in both systems.

How do you design a SCD accomodate changes from both systems to the dimension?

We follow a fairly strict Kimball methodology - fyi, but I am open to looking at everyone's solutions.

Do you have a dimension record per source system, or do you merge the locations beforehand and only load one location? — Nick.Mc, Jul 26 '16 at 23:54
in staging I have two dimensions record for one location - one from each source system. Physically it is one location - I am not sure of the best practice how to handle it in the DW. Is it one dimension record with 2 surrogate business keys? Is it one record with a xref table with the surrogate business keys listed there? Or is it two dimension records? Or another way..? — tember, Jul 27 '16 at 17:35
I can't edit the comment... everywhere I said "surrogate business key" it should just say "business key" — tember, Jul 27 '16 at 17:58

score 1 · Accepted Answer · answered Jul 27 '16 at 22:49

Not necessarily an answer but here are my thoughts:

You've already covered the real options in your comment. Either:

A. Merge it beforehand

You need some merge functionality in staging which matches the two (or more) records, creates a new common merge key and uses that in the dimension. This requires some form of lookup or reference to be stored in addition to normal DW data

OR

B. Merge it in the dimension

Put both records in the dimension and allow the reporting tool to 'merge' it by, for example, grouping by location name. This means you don't need prior merging logic you just dump it in the dimension

However you have two constraints that I feel makes the choice between A & B clearer

Firstly, you need an SCD (Type 2 I assume). This means Option B could get very complicated as when there is a change in one source record you have to go find the the other record and change it as well - very unpleasant for option B. You still need some kind of pre-stored key to link them, which means option B is no longer simple

Secondly, given that you have two sources for one attribute (Location Name), you need some kind of staging logic to pick a single name when these don't match

So given these two circumstances, I suggest that option A would be best - build some pre-merging logic, as the complexity of your requirements warrants it.

You'd think this would be a common issue but I've never found a good online reference explaining how someone solved this before.

Thanks for the answer. I will handle it in staging, but I still don't know what value would be best as the business key in the dimension table. And, I would still love to see some examples of this being implemented - or what is considered a best practice. Because, I agree, it should be a common issue and there should be doc on how to best handle it.... — tember, Aug 03 '16 at 20:58
I'd love to see an example too. Really what you're doing is merging, and Kimball suggests generating a 'durable key'. The bottom section in this article is probably of interest to you: http://www.kimballgroup.com/2012/07/design-tip-147-durable-super-natural-keys/ I don't know if Kimball has all the answers but he's the only one posing possible solutions. — Nick.Mc, Aug 03 '16 at 23:45

score 0 · Answer 2 · edited Aug 01 '18 at 10:13

My thinking is actually very trivial. First you need to be able to conclude what is your master dataset on Geo+Location and granularity.

My method will be:

DIM loading

Say below is my target

Dim_Location = {Business_key, Longitude, Latitude, Location Name}

Dictionary

Business_key = Always maps to master record from source system (in this case it is the execution system). Imagine now the unique key from business is combined (longitude, latitude) for this table.

Location Name = Again, since we assume the "Execution system" is master for our data then it will host from Source="Execution System".

The above table is now loaded for Fact lookup.

Fact Loading

You have already integrated record between execution system and billing system. It's a straight forward lookup and load in staging since it exists with necessary combination of geo_location.

Challenging scenarios

What if execution system has a late arriving records on orders? What if same geo_location points to multiple location names? Not possible but worth profiling the data for errors.

Design a dimension with multiple data sources

2 Answers2

Linked