1

I am building a data warehouse for my company. Recently, I just realized that there are some holes (potentially very dangerous) in my SCD type 2 dimension implementation, so that I have to review it over.

The current "fromdate" of a SCD type 2 dimension table is the date it came to the data warehouse, or the date that it replaced an older record, and the "todate" is usually null, or the date a new record with a same natural key came in replacing the old record.

Currently, when loading fact, I get the surrogate key for that fact by using the natural key and condition iscurrent = true or todate = null.

I just realize that this doesn't guarantee the correctness of the surrogate key for fact, for example:

  1. What if the change happened at 11:00AM. This means: half of the transaction occured during that day will be related to the old dimension record, but half of them will related to the new dimension record. But when the data comes to the data warehouse, all the transactions of that day will be treated to be related to the new dimension, and this is not correct.

  2. If we use datetime of the transactions to get the surrogate key more precisely, when loading fact records to the data warehouse, all the transaction that occured before the day the dimension comes to the Data Warehouse will not be able to find any dimension surrogate key related to it. For example: I made the dimension table yesterday, so all the start-date in that SCD 2 dimension table will have a min value of yesterday, while nearly all the old transactions (which haven't been loaded to the data warehouse) happened before that day. So they will have no surrogate key. Such paradox.

  3. I even try to make it more precise by consolidate the start-date of a row, by trying to pass the create date of that dimension row in the OLTP system. But still I can not find the most correct way to do it. First the datetime in Data Warehouse and the OLTP system is different (because they might belong to different GMT+X)...

  4. And many other problems .....

I understand that if we want to track the history in a perfectly precise accuracy, the only way is that we must implement it in the OLTP system by directly writing the related entity to the transaction records. Data warehouse can not do it. But I still feel that there are too many holes in the SCD 2 concept, or that I didn't implement the SCD Typ2 2 system correctly. So please teach me if the above problems is normal, or point the mistake in my understanding out for me.

Nick.Mc
  • 18,304
  • 6
  • 61
  • 91
Pblade
  • 133
  • 1
  • 9

1 Answers1

0
  1. If time matters, use a datetime not a date. But first consider whether time matters

  2. Again solved by use of datetime

  3. Decide what timezone your datawarehouse is in.

    • UTC
    • Source System
    • Local system
    • A timezone aware data type

Just a note: I suggest you use 2099-01-01 rather than NULL as your end date for the current record. Then you can easily use between when searching for a matching dimension member.

  1. You'll need to be more specific

Edit:

One observation based on the comments so far: Don't use Is_Current to look up the surrogate key, use the business key in the transaction and the transaction datetime between the dimension start and end date.

This means you can reload data from three months ago and it will pick up the correct dimension member (not the current one)

This reinforces my other comment to not use NULL for the active record end date. Instead use a datetime way into the future. so you can always between these dates and get a result

Nick.Mc
  • 18,304
  • 6
  • 61
  • 91
  • Date or date time doesn't matter. Because the start date of a new row in the dimension table depends on the data warehouse, while the date of a transaction depends on the OLTP system. EG: if the sync interval of the data warehouse is 2 days, then the start date of the new dimension row will be the day 2, but in fact it was actually created at day 1 and many transactions that is related to that is already happened at day 1. So if you use the time of transaction to track, all transaction in day 1 will lose it dimension. If you use current, half of transaction will be wrong. – Pblade Nov 01 '19 at 02:27
  • _Firstly_: don't use is_current to look up the surrogate key. Use the business key in the transaction you are loading, and the datetime of the transaction _between_ the SCD start and end date. This means you can load transactions from any time and it will pick the correct surrogate key. – Nick.Mc Nov 01 '19 at 06:46
  • Secondly: unless you have the 'true' datetime that the dimension member was "effective from" in the source system, then there's not much you can do. If you read the source system every five days, and there is no indicator in the source system as to when the dimension (say: location) became active then there isn't anything you can do. Worse is if a record changed twice in those five days and you missed the first change altogether. – Nick.Mc Nov 01 '19 at 06:48
  • So you mean I should use the created date/updated date of that row in the OLTP system to populate the startdate column of the SCD type 2 table, right? And if I can't have that information, I should give up on making an SCD type 2 table right? The thing is that the guide from Kimball group I have read suggested I use the date that row came to our DW. (Design tip 107 - convert(char(10), getdate()-1, 101), even the SSIS's SCD type 2 function also uses this date to make SCD type 2 table. But this approach can never achive true correctness, because of the sync interval of the data warehouse. – Pblade Nov 01 '19 at 07:22
  • Don’t give up. Just recognize this issue and communicate to stakeholders as a risk. I strongly suggest that you do not use the SSIS SCD component. – Nick.Mc Nov 01 '19 at 07:28
  • You should always refresh from source as frequently as possibly. I’ve never heard of a system that refreshed every two days. One day maximum. – Nick.Mc Nov 01 '19 at 07:29
  • Then in the end, what datetime would you suggest me to fill to the startdate of the dwh scd type 2 tables? OLTP created/updated datetime or the date that that rows come to the data warehouse? If you suggest I use the date from the OLTP, then if OLTP doesn't have, then can the data warehouse date be still legit to use? If it's valid, then I should increase the sync interval to overcome the problem, right? I am afraid that short sync interval may cause burden to the OLTP system, so that is why I want a longer sync interval. – Pblade Nov 01 '19 at 08:23
  • You answered your own question there. Use the source system if possible. If that doesn't exist, use the ingestion datetime. Try and reduce latency from the source system – Nick.Mc Nov 01 '19 at 10:08
  • There are many ways to reduce load on source system i.e. use incremental loads, use a read only replica. It's actually better to have regular managed scheduled extracts from source systems than it is to have unmanaged expensive adhoc queries against it – Nick.Mc Nov 01 '19 at 13:37