0

Where I work we initialize ("INIT" => truncate & load) a Fact\ Dimension table on the rare occasions when the need arise.

Such "INIT" requires all object referencing the initialized object to be initialized subsequently in order to preserve RI (Referential Integrity).

Example – we found a bug in an attribute rich (~25) SCD Dim_Employees which will change the effective dates of each record. This requires all referencing objects to recalculate their Foreign Keys.

Do you have the same situation and if so how do you manage it ?

1 Answers1

0

We have redimensioned our data multiple times as we change from SCD2/SCD1 to SCD6 or found things wrong with one of the data streams, like you mention.

It's not too hard to remap your data, you just need to make before you truncate you data in your INIT step to clone it, (or clone within your time travel window) then you join your fact table to the old dimension on the old dimension key, and then join to you new dimension via the old dimension foreign key and time, now you know the old key to new key mapping. Now that becomes the source of your update, and if you do it one operate with all other ETL operations paused, you have maintained your data integrity.

If you cannot pause the live ETL processes, you will a multi step update where you have your new divisional table and old one live, and do your normal mapping to both, and fix your reads to coalesce the results, and then once your new facts are correctly mapping to new dimension, turn around and back-fill all the old facts with ND keys, and then you have no ND gaps, then you can stop the OD/ND coalesce, and then stop the OD mapping and drop the OD column...

Simeon Pilgrim
  • 22,906
  • 3
  • 32
  • 45
  • Thanks Simeon ! I'm going over your proposed solution and it seems that performing an INIT on a key Dimension like DIM_CUSTOMER requires manually handling at least several dozens FACT tables. This is a huge risk and time consuming. Do you have some kind of logic built into existing ETL maps to address this situation ? – Expialidoshes Mar 25 '19 at 14:25
  • 1
    Remapping is expensive, we go the first route, our ETL orchestration tooling (in house) spins up a extra-large warehouse for these tasks (or larger is we deem it needed) and that update task blocks normal data ingress. Thus we introduce some lag/latency for some data, but have a simpler process, and our development team release window is the down time for our manager customer geographies. – Simeon Pilgrim Mar 26 '19 at 20:25