Using "one hot" encoded dependent variable in random forest

Question

I'm building a random forest in python using sklearn-learn, and I've applied "one hot" encoding to all of the categorical variables. Question: if I apply "one hot" to my DV, do I apply all of its dummy columns as the DV, or should the DV be handled differently?

Short answer: You can build the model WITHOUT one hot encoding decision variable. — abhiieor, Dec 03 '18 at 08:42

LOrD_ARaGOrN · Answer 1 · 2018-12-03T09:10:29.530

0

You need to apply one-hot encoding to all those columns where the values are not in numbers.You can handle DV with one-hot and other non-numerical columns with some other encoding as well. E.g: Suppose there is a column with city names, you need to change this into numerical form. This is called as DATA MOLDING. You can do this molding without one-hot as well.

E.g: there is DV column for diabetes with entry "yes" and "no". This is without one-hot encoding.

diabetes_map = {True : 1, False : 0}
df['diabetes'] = df['diabetes'].map(diabetes_map)

edited Dec 03 '18 at 09:10

answered Dec 03 '18 at 09:02

LOrD_ARaGOrN

3,884
3
27
49

2

it will be better to tell the reason of down vote so that answer can be corrected. – LOrD_ARaGOrN Jun 23 '19 at 04:19

score 0 · Answer 2 · answered Dec 03 '18 at 12:40

Depends on the type of problem you have. For binary or multi-class problems, you do not need to one hot encode dependent variable in scikit-learn. Doing one-hot encoding will change the shape of the output variable from single dimension to multi-dimensions. This is called as label-indicator matrix, where each column denotes the presence or absence of that label.

For example, doing one-hot encoding of the following:

['high', 'medium', 'low', 'high', 'low', 'high', 'medium']

will return this:

high    medium    low
 1        0        0
 0        1        0
 0        0        1
 1        0        0
 0        0        1
 1        0        0
 0        1        0

Not all classifiers in scikit-learn are able to support this format, (even though they support multi-class classification) Even in those that do support this, this will trigger the multi-label classification (in which more than one label can be present at once) which is what you dont want in a multi-class problem.

Using "one hot" encoded dependent variable in random forest

2 Answers2