I have a very large table that has duplicate name and address information. This table feeds a process which performs a task and appends results back to the table. I'd like to reduce the volume of what's fed into this process by creating a hash key on the name and address information. That way I can feed one record per hash key, reducing my input by 75%. And I need this key to be persistent over time.
However, since this hash key would serve as the key on which I'd join my results table, I need it to be unique. I can create the hash key as a persisted column, and give it a unique constraint, but I'm concerned about the admittedly tiny chance of collision. If two different name and address strings could possibly produce the same hash output, I still need a unique key for both of them.
Even if it's incredibly unlikely that this would ever happen, I'm not happy knowing that if it did, I wouldn't have a plan.
I've also considered using the table's surrogate ID, and assigning every record within a group of Names & Addresses with the MIN(surrogateID). However, if the record corresponding to the MIN(surrogateID) for a given group was deleted, now my ID has changed.
I could create a lookup table of distinct names and addresses, and give each a simple integer ID. But I'd prefer to avoid the storage cost.
Are there any other options I might not be considering?