Data vault: Hash keys in staging table - advanced

Question

I'm looking into using the datavault 2.0 methodology. I understand the reasons for hashing and trying to apply it. I'd like to apply this in the "staging" phase of the datavault rather than loading it into the DV.

If a table has a business key in it, then it's easy to simply apply it to that table (probably becomes a hub). But there are tables like "orderdetail" (which probably become links) which have multiple references to other elements through a surrogate key.

Should the staging table contain both the surrogate sequence for each foreign key along with a hash for the referenced entity BK?

Example: If I have an order table with a customerId surrogate sequence, but the customer table has a CUST-000xxx reference that is used as the BK, should I perform a 'join' between order and customer to resolve the "CUST-000xxx" so I can hash it and include it in the order staging table?

I was thinking that this could potentially be resolved when loading the data in the DV from the staging area, but the customer reference may not exist in the staging area at that particular moment in time, because the order may simply be a new order for an existing customer that didn't change.

DV 2.0 specifies that all of this business with hashes is done to increase performance and simply load the data in parallel without expensive lookups in the DV itself. Hence the question how this is usually resolved.

Example added down here:

order - orderid - customerid - order_ref - salespersonid

customer - customerid - customer_ref

person - personid - full_name - login

In order to populate order, should I do a join in the source database like so:

SELECT
   hash_func(o.order_ref) as hash_key_order,
   hash_func(c.customer_ref) as hash_key_customer,
   hash_func(p.login) as hash_key_person,
   o.orderid, 
   c.customerid,
   p.login
FROM
   order o inner join customer c on o.customerid = c.customerid
   inner join person p on o.salespersonid = p.personid

or is the resolution for foreign keys resolved in the datavault, so the query is simpler like:

SELECT
   hash_func(o.order_ref) as hash_key_order,
   o.orderid, 
   c.customerid,
   p.personid
FROM
   order o

This is not clear to me. What I understand is that expensive lookups are avoided through hashing, so not generating the hash on staging for foreign keys do not contribute to performance otherwise?

Seems to me that this is more of a business rule problem than loading. If this is the case, how is the problem with the empty reference solved right now in the real business? How do they handle empty keys with new orders or unchanged customer. Seems to me that only a part of Data Vault is being applied. Maybe expand the question with actual row data to make it more clear. — tobi6, Feb 14 '18 at 15:17

score 1 · Answer 1 · answered Feb 16 '18 at 17:11

1

I am not really sure why there is a complex SELECT statement to populate order. Also I think there might be some mixup regarding the paradigmas of Data Vault. What you want to do with Data Vault is to read all data from the source systems.

This means first you would load the table Order into the core DWH which seems to be modelled with Data Vault. Then you would do the same with Customer, Person and so on. Up until all data which you need later for your statment to work is in the core DWH.

Every entity would have it's own hash key depending on the entity. E.g. for the Order table this might be the id.

Now, when everything is loaded into Data Vault, you can re-create your business rule on top of the data. Meaning, if you use surrogate keys, you might need to recreate them. If the surrogate keys are created beforehand in the database and they are of business value, take them.

But this depends on the use of the ids. As I commented before, you first need to know how the business handles the cases you provided. Then you load every data on it's own into the Data Vault. And then you go on an recreate the statement you have added as an example.

So:

Copy the data
Recreate the business rules (what happens if there is an order without a customer)
Create views / persistant tables
Use the data

answered Feb 16 '18 at 17:11

tobi6

8,033
6
26
41

Thank you. (I re-edited the 2nd query that was incorrect). What puzzles me in the dv 2.0 technique is that hash keys are stated as a way to improve performance, but in the loading chapter everything is still looked up by BK anyway. The example in the book just isn't very good because it's derived from a flat wide table and therefore doesn't concern itself with foreign keys (and joins resulting from that?) which you'd have in typical 3NF data. I think that's a shortcoming of the book really. – radialmind Feb 18 '18 at 09:33
If you mean the book by Linstedt, Olschimke then I think the business keys are printed there only for the examples. Looking at the statements, every entity is being loaded itself. So none of your statements are correct in a DV2.0 sense, but you should load every entity on its own. Foreign keys will be solved in Links. – tobi6 Feb 19 '18 at 09:27
Sure, but I think that's a huge design gap? It's easy to load hubs and tables that have very clear business keys, but links don't have business keys associated with them like hubs in the datavault methodology? I.e., the book shows hash keys on link tables only, no surrogate sequences or natural keys to look up from those tables. I wrote more about three possibilities to generate hashes here: https://gtoonstra.github.io/etl-with-airflow/datavault2.html#data-vault-loading-flow . Let me know your thoughts... – radialmind Feb 20 '18 at 19:47
1

I am still not sure we are on the same level here. Links do have business keys associated - the hash keys. If you need natural keys, you either join the hub (which is a good idea since the natural key in the hub should be indexed) or add a column with the natural key. Surrogate keys can - again - be build like in your business. Like in the business, the surrogate keys are build after a rule. This rule can be implemented on top of the data. And you can build a surrogate link with that rule. – tobi6 Feb 21 '18 at 08:24

score 0 · Answer 2 · answered Feb 26 '18 at 05:20

0

The problem is that you are not exporting your BK. You are exporting the surrogate keys. Change o.orderid to o.order_ref etc in your queries and everything should fit into place. Unfortunately people don't understand the reasons behind ID's. They are an internal database element used for performance and management purposes, and have NOTHING to do with the BUSINESS.

pcd

answered Feb 26 '18 at 05:20

pcd

31
1

Well said. I am still not sure if the idea of the OP is only half of what the Data Vault methodology recommends. – tobi6 Feb 28 '18 at 14:51
1

It is frustrating to see database developers / modellers start with using an ID identity column in every table without having any idea why, then to cap it off they fail to please a unique constraint on the natural key of the table (and the other canddate keys). Why do people insist on performing a 'select * from ' and then thinking that they can use the ID's as the hub key. Please take time to understand the methodology, take a DV modelling course, either one, it does not matter, but develop a proper understanding of how Data Valt works.
– pcd Mar 02 '18 at 00:39

Data vault: Hash keys in staging table - advanced

2 Answers2