When hashing in staging, it is possible to prepare your data to be split in Hubs
, Satellites
and Links
. When loading data into the staging area, you only need to calculate hashes once (e.g. CustomerHashKey
, CustomerHashDiff
, CustomerContractLinkHashKey
). Besides those hashes, a Sequence
number, a Load Date
and a Source reference
, no additional metadata is put into the staging area. This gives a bit more clarity as in what metadata is added at staging.
In the next step, loading from staging into the data warehouse simply SELECT
the source data and put it into the desired structure (e.g. HubCustomer
, SatCustomer
, LinkCustomerContract
). Since the hashes have already been generated once, no additional processing is needed when loading. Also, there is only one point where hashes are being generated - in the staging processes. All other processes downstream can use them.
If you would generate the hash from staging into the data warehouse, you might end up generating those hashes multiple times (Hub
, Satellite
(+diff), Link
) which could become a performance problem for Big Data use cases or machines with weak CPUs. Think of the HashDiff
on satellites: if the hashing happens in the staging area, no additional processing is needed when inserting into the data warehouse (which could get expensive quickly when there are a lot of columns in a table).
So I would always hash in the staging area if possible.