I have ~2TB of CSV's where the first 2 columns contains two ID numbers. These need to be anonymized so the data can be used in academic research. The anonymization can be (but does not have to be) irreversible. These are NOT medical records, so I do not need the fanciest cryptographic algorithm.
The Question:
Standard hashing algorithms make really long strings, but I will have to do a bunch of ID-matching (i.e. 'for subset of rows in data containing ID XXX, do...)' to process the anonymized data, so this is not ideal. Is there a better way?
For example, If I know there are ~10 million unique account numbers, is there a standard way of using the set of integers [1:10million] as replacement/anonymized ID's?
The computational constraint is that data will likely be anonymized on a 32-core ~500GB server machine.