3

I'm building a random forest in python using sklearn-learn, and I've applied "one hot" encoding to all of the categorical variables. Question: if I apply "one hot" to my DV, do I apply all of its dummy columns as the DV, or should the DV be handled differently?

Uncle_Timothy
  • 101
  • 1
  • 2
  • 10

2 Answers2

0

You need to apply one-hot encoding to all those columns where the values are not in numbers.You can handle DV with one-hot and other non-numerical columns with some other encoding as well. E.g: Suppose there is a column with city names, you need to change this into numerical form. This is called as DATA MOLDING. You can do this molding without one-hot as well.

E.g: there is DV column for diabetes with entry "yes" and "no". This is without one-hot encoding.

diabetes_map = {True : 1, False : 0}
df['diabetes'] = df['diabetes'].map(diabetes_map)
LOrD_ARaGOrN
  • 3,884
  • 3
  • 27
  • 49
0

Depends on the type of problem you have. For binary or multi-class problems, you do not need to one hot encode dependent variable in scikit-learn. Doing one-hot encoding will change the shape of the output variable from single dimension to multi-dimensions. This is called as label-indicator matrix, where each column denotes the presence or absence of that label.

For example, doing one-hot encoding of the following:

['high', 'medium', 'low', 'high', 'low', 'high', 'medium']

will return this:

high    medium    low
 1        0        0
 0        1        0
 0        0        1
 1        0        0
 0        0        1
 1        0        0
 0        1        0

Not all classifiers in scikit-learn are able to support this format, (even though they support multi-class classification) Even in those that do support this, this will trigger the multi-label classification (in which more than one label can be present at once) which is what you dont want in a multi-class problem.

Vivek Kumar
  • 35,217
  • 8
  • 109
  • 132