An improved version:
The previous solutions do not scale quite well when the dataframe
is large.
The situation also becomes complicated when you want to perform one-hot encoding for one column only and your original dataframe has more than one columns.
Here is a more general and scalable (faster) solution.
It is illustrated with an example df
with two columns and 1 million rows:
import pandas as pd
import string
df = pd.DataFrame(
{'1st': [random.sample(["orange", "apple", "banana"], k=1)[0] for i in range(1000000)],\
'2nd': [random.sample(list(string.ascii_lowercase), k=1)[0] for i in range(1000000)]}
)
The first 10 rows df.head(10)
is:
1st 2nd
0 banana t
1 orange t
2 banana m
3 banana g
4 banana g
5 orange a
6 apple x
7 orange s
8 orange d
9 apple u
The statistics df['2nd'].value_counts()
is :
s 39004
k 38726
n 38720
b 38699
t 38688
p 38646
u 38638
w 38611
y 38587
o 38576
q 38559
x 38558
r 38545
i 38497
h 38429
v 38385
m 38369
j 38278
f 38262
e 38241
a 38241
l 38236
g 38210
z 38202
c 38058
d 38035
Step 1: Define threshold
threshold = 38500
Step 2: Focus on the column(s) you want to do one-hot encoding on, and change the entries with frequency lower than the threshold to others
%timeit df.loc[df['2nd'].value_counts()[df['2nd']].values < threshold, '2nd'] = "others"
Time taken is 206 ms ± 346 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
.
Step 3: Apply one-hot encoding as usual
df = pd.get_dummies(df, columns = ['2nd'], prefix='', prefix_sep='')
The first 10 rows after one-hot encoding df.head(10)
becomes
1st b k n o others p q r s t u w x y
0 banana 0 0 0 0 0 0 0 0 0 1 0 0 0 0
1 orange 0 0 0 0 0 0 0 0 0 1 0 0 0 0
2 banana 0 0 0 0 1 0 0 0 0 0 0 0 0 0
3 banana 0 0 0 0 1 0 0 0 0 0 0 0 0 0
4 banana 0 0 0 0 1 0 0 0 0 0 0 0 0 0
5 orange 0 0 0 0 1 0 0 0 0 0 0 0 0 0
6 apple 0 0 0 0 0 0 0 0 0 0 0 0 1 0
7 orange 0 0 0 0 0 0 0 0 1 0 0 0 0 0
8 orange 0 0 0 0 1 0 0 0 0 0 0 0 0 0
9 apple 0 0 0 0 0 0 0 0 0 0 1 0 0 0
Step 4 (optional): If you want others
to be the last column of the df
, you can try:
df = df[[col for col in df.columns if col != 'others'] + ['others']]
This shifts others
to the last column.
1st b k n o p q r s t u w x y others
0 banana 0 0 0 0 0 0 0 0 1 0 0 0 0 0
1 orange 0 0 0 0 0 0 0 0 1 0 0 0 0 0
2 banana 0 0 0 0 0 0 0 0 0 0 0 0 0 1
3 banana 0 0 0 0 0 0 0 0 0 0 0 0 0 1
4 banana 0 0 0 0 0 0 0 0 0 0 0 0 0 1
5 orange 0 0 0 0 0 0 0 0 0 0 0 0 0 1
6 apple 0 0 0 0 0 0 0 0 0 0 0 1 0 0
7 orange 0 0 0 0 0 0 0 1 0 0 0 0 0 0
8 orange 0 0 0 0 0 0 0 0 0 0 0 0 0 1
9 apple 0 0 0 0 0 0 0 0 0 1 0 0 0 0