How to handle with dummy features

Question

I am sorry if the question came up earlier, but I found nothing like that. I have a problem with the predictive models. I would like to build xgboost and random forest. The package I use requires that in xgboost construction, dummy variables should be created. The question is whether I should use a dummy set to build both? (even though forests can handle with calculation and do not require dummies)? To test the models and compare it I should also change the categorical variables in the train set into dummies, righ? In other words, my trainig set and test set have to be the same for every model? Thank you very much for your help!

You should be using the functions in R built for the purpose of constructing the correct matrix forms. I'm pretty sure this is covered in other questions that have answers. Perhaps you just need to search more completely on `model.matrix` and `factor`. — IRTFM, Jan 11 '18 at 00:08
There is a machine learning stackexchange forum. There might be useful Q&A there, but I'm voting to close this as too broad for SO. — IRTFM, Jan 11 '18 at 00:19

score 2 · Answer 1 · answered Jan 11 '18 at 13:00

I guess you are using the mlr package as you have tagged your question with mlr.

Anyway, when you create dummies, you have to make sure that your training set does not contain variables that are not included in the test set (this can easily happen when you create dummies). Otherwise you will have trouble when you try to make prediction on the test set (as the trained model supposes that the test set has at least the same variables).

Instead of creating dummies, you could also convert your categorical variables to an integer (if I am not mistaken, this is what xgboost does internally anyway). This is why we are forcing to create dummies if you fit a xgboost model with mlr (see https://github.com/mlr-org/mlr/issues/1561).

If you don't want to create dummies you could also do this:

library(mlr)
lrn = makeLearner("regr.xgboost")
train(lrn, bh.task) # this gives you an error

lrn$properties = c(lrn$properties, "factors")
train(lrn, bh.task) # this works as xgboost supports factors

How to handle with dummy features

1 Answers1