Anonymizing data / replacing names

Question

Normally I anonymize my data by using hashlib and using the .apply(hash) function.

Now im trying a new approach, imagine I have to following df called 'data':

df = pd.DataFrame({'contributor':['eric', 'frank', 'john', 'frank', 'barbara'],
                   'amount payed':[10,28,49,77,31]})

  contributor  amount payed
0        eric            10
1       frank            28
2        john            49
3       frank            77
4     barbara            31

Which I want to anonymize by turning the names all into person1, person2 etc, like this:

output = pd.DataFrame({'contributor':['person1', 'person2', 'person3', 'person2', 'person4'],
                       'amount payed':[10,28,49,77,31]})

  contributor  amount payed
0     person1            10
1     person2            28
2     person3            49
3     person2            77
4     person4            31

So my first though was summarizing the name column so the names are attached to a unique index and I can use that index for the number after 'person'.

score 8 · Accepted Answer · answered Mar 16 '18 at 13:03

8

I think faster solution is use factorize for unique values, add 1, convert to Series and strings and prepend Person string:

df['contributor'] = 'Person' + pd.Series(pd.factorize(df['contributor'])[0] + 1).astype(str)
print (df)
  contributor  amount payed
0     Person1            10
1     Person2            28
2     Person3            49
3     Person2            77
4     Person4            31

answered Mar 16 '18 at 13:03

jezrael

822,522
95
1,334
1,252

This is actually really useful and fast method. Thank you for introducing me to factorize, ive never used it before! – Erfan Mar 16 '18 at 14:20
1

Beautiful! Thanks a lot! – Michael Dorner Nov 26 '20 at 09:14
1

For stand-alone cases `factorize` works well; But, for the cases where anonymized values needs to maintain referential-integrity across some other data-frame column (basically to retain db-level referential relationship) then `hash` based approach will be safer. [reference-safe-anonym-util-gist](https://gist.github.com/joshuamosesb/b68c3fd9ef84c33a6f6ff9330ecde35e) – Joshua Baboo Oct 08 '21 at 10:32

score 1 · Answer 2 · answered Dec 13 '18 at 18:51

1

labels, uniques =  pd.factorize(df['name'])
labels = ['person_'+str(l) for l in labels]
df['contributor_anonymized'] = labels

answered Dec 13 '18 at 18:51

L. Astola

47
4

score 0 · Answer 3 · answered Mar 15 '18 at 22:14

Maybe try to create a data frame called "index" for this operation and keep unique name values inside it?

Then produce masks with unique name indexes and merge the resulting data frame indexwith data.

index = pd.DataFrame()
index['name'] = df['name'].unique()
index['mask'] = index['name'].apply(lambda x : 'person' + 
str(index[index.name == x].index[0] + 1))

data.merge(index, how='left')[['mask', 'amount']]

Anonymizing data / replacing names

3 Answers3

Linked