2

I'd like to 'anonymize' or 'recode' a column in a pandas DataFrame. What's the most efficient way to do so? I wrote the following, but it seems likely there's a built-in function or better way.

dataset = dataset.sample(frac=1).reset_index(drop=False) # reorders dataframe randomly (helps anonymization, since order could have some meaning)

# make dictionary of old and new values
value_replacer = 1
values_dict = {}   
for unique_val in dataset[var].unique():
    values_dict[unique_val] = value_replacer
    value_replacer += 1

# replace old values with new
for k, v in values_dict.items():
    dataset[var].replace(to_replace=k, value=v, inplace=True)
user1318135
  • 717
  • 2
  • 12
  • 36

2 Answers2

4

Alternative way

df.col.astype('category').cat.codes.add(1)
Out[697]: 
0    1
1    1
2    2
3    3
4    4
5    2
dtype: int8

Prefer using the answer of MaxU:)

%timeit df.col.astype('category').cat.codes.add(1)#Wen
1000 loops, best of 3: 437 µs per loop
%timeit df['col'] = pd.factorize(df['col'])[0] + 1#MaxU
1000 loops, best of 3: 194 µs per loop
BENY
  • 317,841
  • 20
  • 164
  • 234
  • Also a great solution. Is there any reason to prefer either of these over the other (besides readability preference)? Seems likely that they're enacted similarly. – user1318135 Sep 11 '17 at 20:52
  • 2
    @user1318135 `category` provides you more flexibility with labels and the like. `pd.factorize` only gives you numbers. – cs95 Sep 11 '17 at 20:54
  • 2
    @user1318135 also , if you only need the level of the column , you should using MaxU's solution , at least 2 times faster than this . – BENY Sep 11 '17 at 20:55
3

IIUC you want to factorize your values:

dataset[var] = pd.factorize(dataset[var])[0] + 1

Demo:

In [2]: df
Out[2]:
   col
0  aaa
1  aaa
2  bbb
3  ccc
4  ddd
5  bbb

In [3]: df['col'] = pd.factorize(df['col'])[0] + 1

In [4]: df
Out[4]:
   col
0    1
1    1
2    2
3    3
4    4
5    2
MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419
  • 1
    For others, note that [0] is necessary because factorize returns labels and associated unique values, so this is to just use the labels part of what is returned. The trailing + 1 is just to match my input question, where I started numbering from 1 instead of 0 (the default). – user1318135 Sep 11 '17 at 20:45
  • 1
    @user1318135, it's a good comment! Please feel free to edit the answer :) – MaxU - stand with Ukraine Sep 11 '17 at 20:46