0

I have a very large table that has duplicate name and address information. This table feeds a process which performs a task and appends results back to the table. I'd like to reduce the volume of what's fed into this process by creating a hash key on the name and address information. That way I can feed one record per hash key, reducing my input by 75%. And I need this key to be persistent over time.

However, since this hash key would serve as the key on which I'd join my results table, I need it to be unique. I can create the hash key as a persisted column, and give it a unique constraint, but I'm concerned about the admittedly tiny chance of collision. If two different name and address strings could possibly produce the same hash output, I still need a unique key for both of them.

Even if it's incredibly unlikely that this would ever happen, I'm not happy knowing that if it did, I wouldn't have a plan.

I've also considered using the table's surrogate ID, and assigning every record within a group of Names & Addresses with the MIN(surrogateID). However, if the record corresponding to the MIN(surrogateID) for a given group was deleted, now my ID has changed.

I could create a lookup table of distinct names and addresses, and give each a simple integer ID. But I'd prefer to avoid the storage cost.

Are there any other options I might not be considering?

tad
  • 23
  • 5
  • 1
    If you created the separate lookup table of distinct names and addresses, would you then not remove those columns from the current table? Thus, where would the additional storage cost come from? Besides, either way, I would usually say that additional storage requirement is sometimes a preferred outcome if it means the approach to the solution is easier (ie. less prone to issues) – Craig Mar 23 '21 at 02:43
  • I agree it sounds like the lookup table would actually reduce storage because then there is one row per address. Maybe there is part of this we don't know about, like an external process that gives you all the addresses every time. Also I'm pretty sure the chance of a NEWID() collision is mind bogglingly small even if you have like a billion rows, which I bet you don't have that many addresses. – Kevin UI Mar 23 '21 at 14:51
  • I receive this table from my customer in this structure. – tad Mar 23 '21 at 15:07

0 Answers0