Data ID pseudonymization

Question

I need to pseudonymize ids in dataset, in order to comply with GDPR. The IDs in question are integers from 0 to 10^7. I am looking form some elegant way to achieve this. The process must be repeatable and easily transferable, therefore I would like to avoid any addition of random seeds. I would also like to avoid having lookup table. In the end, I am looking for elegant function to transform those numbers in non trivial way, so a person who will be able to identify one or two id-pseudonymized id pairs from the data is not able to guess the function.

So far, I came up with splitting the number in two, adding different constants to each of those new numbers, following by modulo operation and recombination of the two numbers into one new id. I am hoping that someone here will suggest better approach.

edit: The ids in database are not static, some are removed while others are added.

I assume you'd like to find a permutation P from the set A to A, where A = {0, 1, 2, ..., 10^7 - 1} where P^(-1) is easy for your code to compute but hard for others to compute. This is a cryptographic problem. You should read about [Feistel ciphers](https://en.wikipedia.org/wiki/Feistel_cipher), particularly [Unbalanced Feistel ciphers](https://en.wikipedia.org/wiki/Feistel_cipher#Unbalanced_Feistel_cipher), and [Format-preserving encryption](https://en.wikipedia.org/wiki/Format-preserving_encryption). Your current designs *sounds* like it already incorporates some of these concepts. — President James K. Polk, Jun 18 '23 at 20:46

score 0 · Answer 1 · answered Jun 18 '23 at 21:27

I’m not sure there’s any feasible way to do this without having some sort of secret information kept separately.

Here’s why. Your IDs range from 0 to 9,999,999, inclusive. If you use any fixed hashing algorithm to map those IDs to hashes, it would be very computationally easy for an attacker to simply compute the hash of each of the ten million possible IDs, then cross-compare those hash outputs against the hashes of the real IDs to determine what those IDs are. (You could conceivably make this harder for an attacker to do by making your hash function very, very hard to compute, but then it’s not very useful for your own purposes.)

On the other hand, if there is some sort of secret information you can keep from an attacker, then you have more options available. For example, if you’re allowed a secret key, you could use that key as part of the hash so that an attacker couldn’t compute all possible hashes as described above without having the key. (@President James K. Polk’s comment mentions format-preserving encryption, which is typically done by having a secret key.)

score 0 · Answer 2 · answered Jun 19 '23 at 11:27

The short answer is encryption. Encrypt the ID numbers with a fixed key. That will give you a different set of numbers having a one-to-one relationship with the original numbers.

If you want to keep the encrypted number within the same, or a similar, range then you may need to use a technique from Format Preserving Encryption, such as Cycle Walking.

score 0 · Answer 3 · answered Jun 19 '23 at 12:38

It's not clear if your anonymizing function should be reversible or not, but presumably it must be injective, or one-to-one i.e. every original ID must map to a different anonymous ID, or at least the chance of a collision must be effectively zero.

If it has to be reversible, i.e. given an anonymous ID, you must be able to restore the original ID, then you need to encrypt the ID.

However, since you mention GDPR, you probably need a non-reversible function, i.e. a one-way function such that given an anonymous ID there is no computationally feasible way restore the original ID. Then you need a secure hash function.

In either case, you need to have a secret. The algorithm must be secret or include a secret of sufficiently large range of variation, otherwise an attacker could fairly simply create the lookup-table. Best practice is not to use secret algorithms, but only secret keys as input to non-secret algorithms.

Finally, there is the question of format preservation, i.e. should the anonymized IDs also be in the range of [0..10^7[ ? I would suggest that you don't have that requirement, especially not if the original ID numbers are a monotonically increasing sequence. If the original ID numbers are random numbers in that range, it's less important.

If you don't have any particular reason for requiring format preservation, I'd suggest using HMAC-SHA-256 . It's easy to implement, and hard to get really wrong. The algorithm is well known and supported in a million libraries and frameworks, and the only secret you need is a 32 - 64 byte random key. It does not actually guarantee to be injective, but in practice it will work. If you must have an absolute guarantee, then it gets a little more complicated. It depends on what the actual scenario is.

Data ID pseudonymization

3 Answers3