LabelEncoder().fit_transform vs. pd.get_dummies for categorical coding

Question

It was recently brought to my attention that if you have a dataframe df like this:

   A      B   C
0  0   Boat  45
1  1    NaN  12
2  2    Cat   6
3  3  Moose  21
4  4   Boat  43

You can encode the categorical data automatically with pd.get_dummies:

df1 = pd.get_dummies(df)

Which yields this:

   A   C  B_Boat  B_Cat  B_Moose
0  0  45     1.0    0.0      0.0
1  1  12     0.0    0.0      0.0
2  2   6     0.0    1.0      0.0
3  3  21     0.0    0.0      1.0
4  4  43     1.0    0.0      0.0

I typically use LabelEncoder().fit_transform for this sort of task before putting it in pd.get_dummies, but if I can skip a few steps that'd be desirable.

Am I losing anything by simply using pd.get_dummies on my entire dataframe to encode it?

elyase · Accepted Answer · 2016-09-23T19:21:30.417

7

Yes, you can skip the use of LabelEncoder if you only want to encode string features. On the other hand if you have a categorical column of integers (instead of strings) then pd.get_dummies will leave as it is (see your A or C column for example). In that case you should use OneHotEncoder. Ideally OneHotEncoder would support both integer and strings but this is being worked on at the moment.

edited Sep 23 '16 at 19:21

answered Sep 22 '16 at 21:25

elyase

39,479
12
112
119

Okay thank you, this confirms my intuitions and I had already made the appropriate changes to my dataset to account for this so all's well. – Jonathan Bechtel Sep 23 '16 at 19:34

LabelEncoder().fit_transform vs. pd.get_dummies for categorical coding

1 Answers1