I have a very large data frame with 790,000 rows and 140 predictors. Some of these are strongly correlated to each other, and on different scales. With the randomForest
package, I can grow a forest on each core using only a small sample of the data, using foreach
and merge them with the combine()
function to get one big tree, like so:
rf.STR = foreach(ntree=rep(125, 8), .combine=combine, .multicombine=TRUE, .packages='randomForest') %dopar% {
sample.idx = sample.int( nrow(dat), size=sample.size, replace=TRUE)
randomForest(x=dat[sample.idx,-1, with=FALSE],
y=dat[sample.idx, retention], ntree=ntree)
}
The correlated variables on different scales leads me to want to use conditional random forests from the party
package, but there's no combine()
method for cforests, so I'm not sure how to combine several cforest objects to get one importance plot or one prediction.
Is there a way to train one big cforest on smaller subsets of the data, or make several small cforests and combine them into one bigger conditional forest model?