0

I would like to create a surrogate key for a hive table, but one that could be replicated every time the data was put in the table. Other tables would reference this table through the surrogate key, and the table could be regenerated to add more rows, and that association wouldn't be broken. My thought is to basically have a composite key of all columns in the table.

Is it reasonable to concatenate all of my columns and take the md5 hash of that string to use as an easy look-up to that row?

The problems that I see with this solution are:

  • If the data changes in the rows, the association will still be broken
  • There is no real guarantee that the hash values are unique (though with my numbers, collisions are very unlikely)

notes on the data:

  • The data is partitioned by day, and there are around 100k rows for each day.
  • There are cases that two rows have the exact same data and it's fine if they end up with the same key.
aaron
  • 854
  • 2
  • 12
  • 23

1 Answers1

1

You have answered your own question:

There is no real guarantee that the hash values are unique (though with my numbers, collisions are very unlikely)

Keys need to be unique, that's their purpose. If you give me a records key (be it surrogate or natural) I can find that record. Hashes are not going to be unique.

You need to go back and ask yourself WHY you want this surrogate key. If its just for a unique identifier then use your DB's unique identifier|sequence type and be done with it.

If there is a business requirement (The need to replicate the SK <- why?) then go back to that reason and try and come up with a more direct|simple solution for it.

(We tried hashes for type2 change detection - it did not work and we went back to column by column comparisons)

This concerns me:

There are cases that two rows have the exact same data and it's fine if they end up with the same key

If you have 2 records in your database that are exactly the same then you are missing data: even a sequence or timestamp, something that can be used to differentiate your records. If you don't have a natural key, you are probably missing something.

Joe
  • 1,327
  • 1
  • 10
  • 19