1

Are Surrogate vs Natural Primary Keys generally debated in the world of data warehouses? To be clear - the natural keys would be there regardless. And by surrogate keys, I mean keys that don't exist in the source system, but are created as part of the ETL of the datawarehouse.

Is it debated whether to rely on the source systems natural keys as primary keys, or to assign surrogate keys as part of ETL?

My (limited) understanding has always been that in operational systems - it could go either way depending on the situation/person, but that in a data warehouse setting - surrogate keys were the non-debated norm for the primary keys.

Accurate, or is it more debated than that?

bbb0777
  • 165
  • 14
  • "Surrogate" is used to mean a lot of things--please explain. Some people use "surrogage" for any value picked by the DBMS since an application implemented with a DB started, but some people restrict it to a value in a DB that is not seen by users of the implemented application. The former folks would call a driver's id a surrogate, but the latter folks would call it a non-surrogate & natural key. Seems like you mean the latter, considering a warehouse users are not application users. PS What has your research shown? See [ask] & the voting arrow mouseover texts. – philipxy Dec 04 '19 at 05:30
  • In this case, I meant surrogate to mean - a key that didn't come from the operational system at all, but was instead created in the data warehouse itself. Fair enough thanks - my research had just been reading (Kimball & Star Schema), which puts surrogates as a definitive thing to do in a warehouse. A co-worker however, hates using surrogates & stated it was debated. I Googled it after that, but the conversations seemed more in general, not specific to data warehouse, so I wasn't clear. – bbb0777 Dec 05 '19 at 00:32
  • Please clarify via edits, not comments. – philipxy Dec 05 '19 at 03:20

1 Answers1

2

Natural keys are virtually essential for almost any practical data warehouse solution (business key or domain key is really a much better term than natural key). The question is whether and when to use surrogate keys as well as, not instead of some other key. Managing surrogate keys can add a lot of complexity and some significant overhead so the best answer is "it depends...".

If your warehouse is based on a distributed write-once technology like HDFS, then surrogates would probably make no sense. If you are using some historical data capture mechanism like Microsoft's temporal tables or Oracle's flashback then you'll probably find no need for surrogates. If you are taking a temporal modelling approach based on 5NF or 6NF then you usually won't need surrogates either but you might want to use them for certain tables.

If you are following a template like Data Vault or Kimball's methods then maybe you'll want to use surrogates because that's what it says in somebody's book.

nvogel
  • 24,981
  • 1
  • 44
  • 82
  • True, thanks, I wasn't clear. The natural (or business) keys would be there no matter what. My question was if it was debated whether to rely on those keys, or to assign additional surrogate keys (to use as primary keys) as part of the ETL. – bbb0777 Dec 05 '19 at 00:35