3

I have ~2TB of CSV's where the first 2 columns contains two ID numbers. These need to be anonymized so the data can be used in academic research. The anonymization can be (but does not have to be) irreversible. These are NOT medical records, so I do not need the fanciest cryptographic algorithm.

The Question:

Standard hashing algorithms make really long strings, but I will have to do a bunch of ID-matching (i.e. 'for subset of rows in data containing ID XXX, do...)' to process the anonymized data, so this is not ideal. Is there a better way?

For example, If I know there are ~10 million unique account numbers, is there a standard way of using the set of integers [1:10million] as replacement/anonymized ID's?

The computational constraint is that data will likely be anonymized on a 32-core ~500GB server machine.

cataclysmic
  • 337
  • 2
  • 12
  • (a*x +b) % m; with m about 10million, a odd and relatively prime wrt m; and keep a and b "secret". – wildplasser Dec 25 '15 at 11:34
  • Is there a format in the account numbers (or each key)? – itsols Dec 25 '15 at 11:42
  • not if `gcd(a,m) ==1` ("relatively prime") . Try it out with {a,m} := (small) prime numbers. (for the OP, m must of course be >= max(original number) ) – wildplasser Dec 25 '15 at 14:52
  • There is a format, but I won't get to see it -- I have to create the anonymization strategy without knowing the format a priori. – cataclysmic Dec 28 '15 at 12:35

2 Answers2

0

It seems you don't care about the ids being reversible, but if it helps, you can try one of the format preserving encryption ideas. They are pretty much designed for this use case.

Otherwise if hashes are too large, you can always just strip the end of it. Even if you replace each digit (of the original ID) with a hex digit (from the hash), the collisions are unlikely. You could first read the file and check for collisions though.

PS. If you end up doing hashing, make sure you prepend salt of a reasonable size. Hashes of IDs in the range [1:10M] would be trivial to bruteforce otherwise.

viraptor
  • 33,322
  • 10
  • 107
  • 191
  • 1
    If you use different salts for everything, you need to store them all. If you use the same salt for all, then it’s not really salt, is it? – Tom Zych Dec 25 '15 at 11:42
  • Technically no, it's not - maybe I shouldn't have called it a salt. If you want to make sure the same original ids correspond to the same hashes, you'd have to store them during processing. But I'm not sure how critical that is in this case. Just generating a huge random salt for all should provide what's needed (prevent bruteforcing). – viraptor Dec 25 '15 at 11:49
  • 1
    Salts in password stores are different so that both search space is extended and common passwords don't hash to the same value. Here we only need extended space. – viraptor Dec 25 '15 at 11:58
0

I will assume that you want to make a single pass, one CSV with ID numbers as input, another CSV with anonymized numbers as output. I will also assume the number of unique IDs is somewhere on the order of 10 million or less.

It is my thought that it would be best to use some totally arbitrary one-to-one function from the set of ID numbers (N) to the set of de-identified numbers (D). This would be more secure. If you used some sort of hash function, and an adversary learned what the hash was, the numbers in N could be recovered without too much trouble with a dictionary attack. Instead I suggest a simple lookup table: ID 1234567 maps to de-identified number 4672592, etc. The correspondence would be stored in another file, and an adversary without that file would not be able to do much.

With 10 million or fewer records, on a machine such as you describe, this is not a big problem. A sketch program in pseudo-Python:

mapping = {}
unused_numbers = list(range(10000000))

while data:
    read record
    for each ID number N in record:
        if N in mapping:
            D = mapping[N]
        else:
            D = choose_random(unused_numbers)
            unused_numbers.del(D)
            mapping[N] = D
        replace N with D in record
    write record

write mapping to lookup table file
Tom Zych
  • 13,329
  • 9
  • 36
  • 53
  • Can someone explain why this is getting downvoted? I was originally thinking of doing something along the same lines...the only downside I can see is that the object 'mapping' grows to be very big, which means the 'if N in mapping:...do' line may get awfully slow towards the end of the anonymization. – cataclysmic Dec 28 '15 at 12:37
  • @cataclysmic: It should not get slow. `mapping` is a `dict`, essentially a hash table, and access should be more like O(1) than O(N). At least, as long as there’s enough RAM to avoid swapping, and you should have plenty of that with the machine you mentioned. – Tom Zych Dec 28 '15 at 22:52