0

I have used a training set to train a random forest model using the randomForest package in R. A variable in the training set is character class and I converted it to as.numeric(factor()).

However, the same variable in the test set is still character. But I surprisingly found that I can still get predictions from the random forest model that I trained, even though that variable is a character class. And I also found that if I also convert that variable in test set to as.numeric(factor()), then the performances on the test set are different.

So does anyone know how R interpret and the random forest model reads and deals with the character variable in the test set while the same variable in the training set is not character class?

Thanks in advance!!

Ian
  • 157
  • 1
  • 7

1 Answers1

0

This is not overly surprising. Your original variables are character strings that can be converted to numerics. So RandomForest is almost certainly doing exactly that. The simple example below reproduces the problem:

library(tidyverse)
library(randomForest)

df <- tibble::tibble(x = c(1:6), y = 1:6)

rf <- randomForest(y~., df)
# "1" is coerced
predict(rf, tibble(x = "1"))
predict(rf, tibble(x = 1))
# "b" fails
predict(rf, tibble(x = "b"))
Robert Wilson
  • 3,192
  • 11
  • 19