I'm building a random forest in python using sklearn-learn, and I've applied "one hot" encoding to all of the categorical variables. Question: if I apply "one hot" to my DV, do I apply all of its dummy columns as the DV, or should the DV be handled differently?
-
2Short answer: You can build the model WITHOUT one hot encoding decision variable. – abhiieor Dec 03 '18 at 08:42
2 Answers
You need to apply one-hot encoding to all those columns where the values are not in numbers.You can handle DV with one-hot and other non-numerical columns with some other encoding as well. E.g: Suppose there is a column with city names, you need to change this into numerical form. This is called as DATA MOLDING. You can do this molding without one-hot as well.
E.g: there is DV column for diabetes with entry "yes" and "no". This is without one-hot encoding.
diabetes_map = {True : 1, False : 0}
df['diabetes'] = df['diabetes'].map(diabetes_map)

- 3,884
- 3
- 27
- 49
-
2it will be better to tell the reason of down vote so that answer can be corrected. – LOrD_ARaGOrN Jun 23 '19 at 04:19
Depends on the type of problem you have. For binary or multi-class problems, you do not need to one hot encode dependent variable in scikit-learn
. Doing one-hot encoding will change the shape of the output variable from single dimension to multi-dimensions. This is called as label-indicator matrix, where each column denotes the presence or absence of that label.
For example, doing one-hot encoding of the following:
['high', 'medium', 'low', 'high', 'low', 'high', 'medium']
will return this:
high medium low
1 0 0
0 1 0
0 0 1
1 0 0
0 0 1
1 0 0
0 1 0
Not all classifiers in scikit-learn
are able to support this format, (even though they support multi-class classification) Even in those that do support this, this will trigger the multi-label classification (in which more than one label can be present at once) which is what you dont want in a multi-class problem.

- 35,217
- 8
- 109
- 132