5

It was recently brought to my attention that if you have a dataframe df like this:

   A      B   C
0  0   Boat  45
1  1    NaN  12
2  2    Cat   6
3  3  Moose  21
4  4   Boat  43

You can encode the categorical data automatically with pd.get_dummies:

df1 = pd.get_dummies(df)

Which yields this:

   A   C  B_Boat  B_Cat  B_Moose
0  0  45     1.0    0.0      0.0
1  1  12     0.0    0.0      0.0
2  2   6     0.0    1.0      0.0
3  3  21     0.0    0.0      1.0
4  4  43     1.0    0.0      0.0

I typically use LabelEncoder().fit_transform for this sort of task before putting it in pd.get_dummies, but if I can skip a few steps that'd be desirable.

Am I losing anything by simply using pd.get_dummies on my entire dataframe to encode it?

root
  • 32,715
  • 6
  • 74
  • 87
Jonathan Bechtel
  • 3,497
  • 4
  • 43
  • 73

1 Answers1

7

Yes, you can skip the use of LabelEncoder if you only want to encode string features. On the other hand if you have a categorical column of integers (instead of strings) then pd.get_dummies will leave as it is (see your A or C column for example). In that case you should use OneHotEncoder. Ideally OneHotEncoder would support both integer and strings but this is being worked on at the moment.

elyase
  • 39,479
  • 12
  • 112
  • 119
  • Okay thank you, this confirms my intuitions and I had already made the appropriate changes to my dataset to account for this so all's well. – Jonathan Bechtel Sep 23 '16 at 19:34