Problem with predicting variables via Random Forest due to issue with categorical variable column

Question

Hi I get the following error;

Error in predict.randomForest(classifier, newdata = grid_set) : 
  variables in the training data missing in newdata

When I type in the following code;

classifier = randomForest(x = training_set[-3],
                          y = training_set$Purchased,
                          ntree = 10)
set = training_set[-3] 
X1 = seq(min(set[, 1]) - 1, max(set[, 1]) + 1, by = 0.01)
X2 = seq(min(set[, 2]) - 1, max(set[, 2]) + 1, by = 0.01)
grid_set = expand.grid(X1, X2)
colnames(grid_set) = c('Age', 'Estimated Salary')
ygrid = predict(classifier, newdata = grid_set)

The issue is there is a 3rd column that is a categorical variable that I thought I had removed by running the code training_set[-3]. Does this not remove that column? Simply adding another layer to my gridset 'X3' referring to the purchased column did not solve the issue either.

I am wondering whether I simply need another method of removing the purchased column from x in the training set data or whether I am going wrong elsewhere

Can you provide a sample of your data using `dput(training_set)` and pasting the output in your question? — Ric S, Oct 12 '20 at 11:23
It might be a case of missing values. Have a look at this: https://stackoverflow.com/questions/8370455/how-to-use-random-forests-in-r-with-missing-values — Lime, Oct 12 '20 at 11:52
Alternatively, you can use the package `missRanger` to impute the missing values. — Lime, Oct 12 '20 at 16:55

Problem with predicting variables via Random Forest due to issue with categorical variable column

0 Answers0