0

I am reading about DW modeling and started wondering why surrogate keys are used at all?

I understand that sometimes business keys are not integers nthat makes the life (as well as joiing and indexing) harder. However, what I do not understand is why to solve kinda limitation of the DW or RDBMS by adding and extra column for managing unique identifiers?

Would it not be more appropriate that this kind of functionality would be transparent to DW/RDBMS users and the entry will get internal identifier from the system automatically? For example creating an SHA-1 digest of the entire row or a subset of it (those fields that can be represented in some kind of a textual format).

aviad
  • 8,229
  • 9
  • 50
  • 98
  • Surrogate keys are typically auto-incremented integers which are 32-bits long. Since you have to duplicate the row identifier as a foreign key where applicable, this generally saves a lot of space over duplicating long strings or multiple columns. – apokryfos Jul 21 '16 at 12:31
  • Understood. What I do not get is why the DW users have to worry about that and why it is not integrated functionality of DW or RDBMS software? – aviad Jul 21 '16 at 12:34
  • My guess is the RDBMS does not want to impose the extra space taken by surrogate keys in the cases where it is not necessary. What the RDBMS offers is a mechanism for the DB developer to define such keys where needed. – apokryfos Jul 21 '16 at 12:37
  • When using the Data Vault 2.0 methodology for the DWH, you actually create a hash digest, using the idea you provided. – tobi6 Jul 22 '16 at 19:18

1 Answers1

1

The reason to use surrogate keys is because you have control over the data warehouse but most likely do not have control over the source systems. Assumptions you make today about the stability of the natural keys can cause you problems in the future.

Issues you may run into by not using your own surrogate key:

  1. Large or complex natural key in source - As you already mentioned, the source system could be using a natural key that will not perform as well as a simple integer
  2. Natural key may be reused in source - I ran into an issue once where the source system would recycle keys starting from 1 again once the maximum value an integer can hold was reached (for the application this made sense). The data warehouse had to recognize that the repeated keys were brand new records.
  3. Mergers - Imagine two companies merging together. Each company has an Employee table with an auto incrementing integer used as the key. Each company will have an Employee #1. The DW warehouse will need a surrogate key to distinguish the two people who share the same ID.
Cory
  • 12,404
  • 7
  • 33
  • 28
  • Good points! However, all your points boil to what I have (partially) mentioned: make no assumptions about the keys used in source system and maintain the unique identifiers at the DB SW level. Would not that be the right way to handle the issue vs reinventing the wheel every sinle time? – aviad Aug 10 '16 at 05:46
  • Yes, make no assumptions about the keys in the source system and maintain unique identifiers at the DB level most likely using an auto incrementing integer field. I'm not sure what you mean by "reinventing the wheel every single time". – Cory Aug 10 '16 at 16:25
  • reinvent the wheel=letting DB user implement the unique ID to a record, instead of standardizing it in DB software. – aviad Aug 14 '16 at 21:35
  • Don't forget Type2 Slowly Changing Dimensions - if you are going the natural key route you automatically have +1 field in your Natural Keys. – Joe Aug 18 '16 at 03:35