Random Forest for a mixture of categorical,numeric and "unwanted" variables which include missing values

Question

I am trying to use Random Forest package in R for my data set which includes categorical and numerical variables as well as some "unwanted coloumns" (coloumns which I do not want to include as my predictor variables). Moreover, some of my desirable variables (which are supposed to be used as predictor) are missing. How can I handle that?

score 0 · Accepted Answer · answered Oct 20 '17 at 09:01

I assumed your dataset looks like something like this.

mydf <- data.frame(target = c(1:100), 
                   param1 = c(rep("a",10), rep("b", 50), 
                              rep("c", 20), rep("a",15), rep(NA, 5)), 
                   param2 = runif(100,0,1), 
                   param3 = c(runif(20,1,10),runif(50,20,30),rep(NA,10),
                              runif(10,0,5), runif(10,70,80)))

To use only desired columns.

a. You can either specify in your formula which columns you want to use in your random forest. myrf <- randomForest(target ~ param1 + param2, mydf) # this excludes param3

b. Else, you can subset your dataset by keeping only desired columns.
```
mydf2 <- mydf[,c(target,param1,param2]
myrf <- randomForest(target ~ ., mydf2)
```
To handle NA values.

a. You may try to impute them.

b. Or you can you another library that may handle them, such as rpart.

Finally, I suggest you have a look at this thread.

How to build random forests in R with missing (NA) values?

Random Forest for a mixture of categorical,numeric and "unwanted" variables which include missing values

1 Answers1