0

I need to anonymyze personal data in our MySql database. The problem is that I still need to be able to link two persons together after they have been anonymized.

I thought this could be done by hashing their social security number or e-mail address, which lead to my question:

When hashing two equal strings (s1 and s1) I get two hash values (h1 and h2), how sure can I be that:

1) the hashed value is equal (h1 = h2)

2) no not equal (s3 = s1) will produce the same hash value

Muleskinner
  • 14,150
  • 19
  • 58
  • 79
  • A hashing function can in theory have a collision. Towards this point, using the SSN of each user as a unique identifier already seems sufficient to me. Why do you need to hash in order to compare users? – Tim Biegeleisen Dec 07 '17 at 09:05
  • Your requirements are the requirements of a cryptographic hash function. If two equal strings didn't output the same hash value, you'd have a broken hash function. In the same vein, if two unequal strings made the same hash (called a collision), again you have a broken hash function. – Robbie Dec 07 '17 at 09:08
  • @Robbie Most of the time I think you'd assume a collision is at least remotely possible and just plan for such a condition. – Tim Biegeleisen Dec 07 '17 at 09:11
  • @TimBiegeleisen The same user (same SSN) can have multiply entries in the same db table. For statistical reasons I need to be able to connect those even after the SSN has been anonymized - thats why I thought about hashing the SSN – Muleskinner Dec 07 '17 at 09:54

2 Answers2

2

1) Same strings will always produce equal hash values
2) Different strings theoretically might produce same hash if you choose small hash length compared to data volume. But using default hash lengths (32 or 40) wont cause such problems.

Edgars T.
  • 947
  • 8
  • 14
1

1) (h1 = h2) is always true for equal strings (s1 and s2) per definition, when using a correct hash function.

2) Two different strings can have the same hash value. This is called a "collsison". The probability depends on the hash function used and the length of the resulting hash. For MD5 for example there are websites and tables for finding collisions, which is quite interesting.

I'm not sure what you mean by linking persons together or what your requirements are, so I cannot help you with that. But you could link two persons together with their ids.

Patrick Adler
  • 196
  • 2
  • 9