Process categorical features when building decision tree models

Question

I was using H2O to build classification models like GBM, DRF and DL. The dataset I have contains a few categorical columns, and if I want to use them as features for building models do I need to manually convert them into dummy variables? I read that GBM can dummify the categorical variables internally?

score 4 · Accepted Answer · answered Jun 08 '17 at 04:56

Yes, H2O is one of the few machine learning libraries that does not require the user to pre-process or one-hot-encode (aka "dummy-encode") the categorical variables. As long as the column type is "factor" (aka "enum") in your data frame, then H2O knows what to do automatically.

In particular, H2O allows direct use of categorical variables in tree-based methods like Random Forest or GBM. Tree based algorithms have the ability to use the categorical data natively and typically this leads to better performance than one-hot encoding. In GLM or Deep Learning, H2O will one-hot encode the categoricals automatically under the hood -- either way you don't need to do any pre-processing. If you want more control, you can control the type of automatic encoding using the categorical_encoding argument.

score 1 · Answer 2 · answered Jul 13 '17 at 15:51

IMHO, being able to handle categorical variables directly in a Tree algorithm is a huge advantage with H2O.

If you do one-hot encoding of a categorical variable you have effectively taken one variable and split them into several variables whose values are mainly 0 (e.g. sparse). As Erin stated, this makes Trees perform worse. This is because Trees use "information gain" at each split. Sparse features (from one-hot encoding) have less information gain, hence are less useful than a categorical feature.

Process categorical features when building decision tree models

2 Answers2