-1

I am training a random forest model in the randomForest package for my data. Some variables are in the class of character. I am pretty sure that randomForest will only take factor and numeric classes as input. So I think R automatically coerces the character into numeric.

In order for me to know how this may affect my modelling result, does anyone know how R automatically coerces the character into numeric class (like an algorithm/rule)? Or any source code I can look at?

I am using R version 4.0.1.

Thanks in advance.

An update: I checked using

getTree(mod,1,labelVar=TRUE)

And I can see that if those character variables are converted to factors, then the "split point" in the output is an integer (which means it is a categorical variable (see: https://www.rdocumentation.org/packages/randomForest/versions/4.6-14/topics/getTree)). But if not converted to factors, then the "split point" in the output is not integer.

So I guess is that R coerces the values of those character variables into numeric values? But how?

Ian
  • 157
  • 1
  • 7
  • I think, they are taken as the dummy variables (one hot encoding). One and zeros in other words. – maydin Jun 29 '20 at 10:02

1 Answers1

0

Not sure right now regarding the random forests in R, but I am kind of convinced, that it only takes factors. If it does take characters as well, it will convert them to factor, not to numeric.

And there is no clear conversion from character to numeric in R.

Georgery
  • 7,643
  • 1
  • 19
  • 52
  • I guess your conclusion that it inly takes factors is not correct. See this post: https://stackoverflow.com/questions/63186926/how-randomforest-package-in-r-interprets-character-variables – Ian Jul 31 '20 at 08:00
  • I checked using `getTree(mod,1,labelVar=TRUE)` And I can see that if those character variables are converted to factors, then the "split point" in the output is an integer (which means it is a categorical variable (see: https://www.rdocumentation.org/packages/randomForest/versions/4.6-14/topics/getTree)). But if not converted to factors, then the "split point" in the output is not integer. So I guess is that R coerces the values of those character variables into numeric values? But how? – Ian Jul 31 '20 at 08:00
  • Again, a guess: factors are basically integer vectors where each level corresponds to a `level`. So, my guess is if you have a character vector `c("1", "2", "3", "1")` but the levels are `c("3", "2", "1")` then the integers resulting from the factor vector would be `c(3, 2, 1, 3)`. Check the levels of the factor your dealing with (`levels(your_vector)`) - maybe it helps. – Georgery Jul 31 '20 at 08:13