I'm looking into using the datavault 2.0 methodology. I understand the reasons for hashing and trying to apply it. I'd like to apply this in the "staging" phase of the datavault rather than loading it into the DV.
If a table has a business key in it, then it's easy to simply apply it to that table (probably becomes a hub). But there are tables like "orderdetail" (which probably become links) which have multiple references to other elements through a surrogate key.
Should the staging table contain both the surrogate sequence for each foreign key along with a hash for the referenced entity BK?
Example: If I have an order table with a customerId surrogate sequence, but the customer table has a CUST-000xxx reference that is used as the BK, should I perform a 'join' between order and customer to resolve the "CUST-000xxx" so I can hash it and include it in the order staging table?
I was thinking that this could potentially be resolved when loading the data in the DV from the staging area, but the customer reference may not exist in the staging area at that particular moment in time, because the order may simply be a new order for an existing customer that didn't change.
DV 2.0 specifies that all of this business with hashes is done to increase performance and simply load the data in parallel without expensive lookups in the DV itself. Hence the question how this is usually resolved.
Example added down here:
order - orderid - customerid - order_ref - salespersonid
customer - customerid - customer_ref
person - personid - full_name - login
In order to populate order, should I do a join in the source database like so:
SELECT
hash_func(o.order_ref) as hash_key_order,
hash_func(c.customer_ref) as hash_key_customer,
hash_func(p.login) as hash_key_person,
o.orderid,
c.customerid,
p.login
FROM
order o inner join customer c on o.customerid = c.customerid
inner join person p on o.salespersonid = p.personid
or is the resolution for foreign keys resolved in the datavault, so the query is simpler like:
SELECT
hash_func(o.order_ref) as hash_key_order,
o.orderid,
c.customerid,
p.personid
FROM
order o
This is not clear to me. What I understand is that expensive lookups are avoided through hashing, so not generating the hash on staging for foreign keys do not contribute to performance otherwise?