I see many feature engineering has the get_dummies step on the object features. For example, dummy the sex column which contains 'M' and 'F' to two columns and label them in one-hot representation. Why we not directly make the 'M' and 'F' as 0 and 1 in the sex column? Does the dummy method has positive impact on machine learning model both in classification and regression model ? If it is , and why? Thanks.
-
dummy values are called noisy labels. Yes they are beneficial to a certain extent. – user1211 Dec 02 '16 at 09:18
-
straight to say, i think separate a column into two, means that the model dimension increased by 1, it is not 100% guarantee the benefit must exist (at least for M/F column ). But adding 1 more dimension to the features set means your program can accept 1 more dimension complexity, thus in some cases it can beneficial to the output accuracy, the cons maybe the system need to maintain a bigger dimension set instead. also you relatively need a larger training set in order to prevent overfitting – SKLTFZ Dec 02 '16 at 09:24
-
Short answer: yes of course. Many classifiers/regressors are only valid for numerical data (where a feature 3 is 3 times as bad/good as a feature with value 1; SVM, NearestNeighbors). Others do not care much (Random-trees). Others can at least benefit from dummies (NNs). This is of course a bad thing for categorical features. Therefore dummies are created. That's very basic stuff. Every ML-tutorial should help you. Build a simple linear-regressor example. It's easy too see in this case. – sascha Dec 02 '16 at 10:23
2 Answers
In general, directly encoding a categorical variable with N different values directly with (0,1, ... , N-1) and turning into a numerical variable won't work with many algorithms, because you are giving ad hoc meaning to the different category variables. The gender example works since it is binary, but think of a price estimation example with car models. If there are N distinct models and if you encode the model A with 3 and model B with 6, this would mean, for example, for the OLS liner regression that the model B affects the response variable 2 times more compared to model A. You can't simply give such random meanings to different categorical values, the generated model would be meaningless. In order to prevent such numerical ambiguity, the most common way is to encode a categorical variable with N distinct values with N-1 binary, one-hot variables.

- 3,589
- 4
- 28
- 57
-
Thanks. In every column of N-1 binary value, are 0 and 1 the best way for encoding ? – yanachen Dec 05 '16 at 02:32
To one-hot-encode a feature with N
possible values you only need N-1
columns with 0
/ 1
values. So you are right: binary sex can be encoded with a single binary feature.
Using dummy coding with N
features instead of N-1
shouldn't really add performance to any Machine Learning model and it complicates some statistical analysis such as ANOVA.
See the patsy docs on contrasts for reference.

- 7,025
- 3
- 36
- 61