6

I have a very large data frame with 790,000 rows and 140 predictors. Some of these are strongly correlated to each other, and on different scales. With the randomForest package, I can grow a forest on each core using only a small sample of the data, using foreach and merge them with the combine() function to get one big tree, like so:

rf.STR = foreach(ntree=rep(125, 8), .combine=combine, .multicombine=TRUE, .packages='randomForest') %dopar% {
  sample.idx = sample.int( nrow(dat), size=sample.size, replace=TRUE)
  randomForest(x=dat[sample.idx,-1, with=FALSE], 
               y=dat[sample.idx, retention], ntree=ntree)
  }

The correlated variables on different scales leads me to want to use conditional random forests from the party package, but there's no combine() method for cforests, so I'm not sure how to combine several cforest objects to get one importance plot or one prediction.

Is there a way to train one big cforest on smaller subsets of the data, or make several small cforests and combine them into one bigger conditional forest model?

Christopher Aden
  • 757
  • 6
  • 21
  • Try the `h2o` package that can be downloaded [here](http://www.h2o.ai). Very fast, open source and runs in parallel. The update so fast that their documentation lacks a bit in consistency sometimes. – Stereo Mar 29 '16 at 03:13
  • Thanks for the pointer! From reading the RF documentation, it doesn't seem like it has conditional trees. Since some of my predictors are strongly correlated, if I run the usual RF, the variable importance will put a bunch of highly predictive but correlated variables at the top, which is less useful than if it only selected one of them. – Christopher Aden Mar 29 '16 at 07:04
  • 2
    The implementation `party::cforest` does not support parallelization (as far as I know). In the (slower) reimplementation `partykit::cforest` we have added support for parallelization but at the moment the `partykit` version does not yet provide all features of the old `party` implementation. Specifically, no variable importance measures are implemented at the moment. Thus, this will not be of much use to you. You could contact the `party` maintainer (Torsten Hothorn) directly whether he has a recommendation how to split up the learning of `party::cforest`. – Achim Zeileis Mar 29 '16 at 21:26
  • Thanks, Achim. That's probably the best I'm going to do on the subject. If you would submit that as a full answer, I'll accept it, until such time as partykit gets an importance() function :). – Christopher Aden Mar 29 '16 at 22:05
  • It's really so unsatisfactory as an answer that a comment is embarrassing enough :-) – Achim Zeileis Mar 29 '16 at 22:40
  • Hah! Alright, fair enough. Seems like the state of conditional random forests are somewhat undeveloped, compared to other random forest variations. Oh well. – Christopher Aden Mar 30 '16 at 06:03
  • @ChristopherAden, did you ever solve this problem? Running into the same issue now – Parseltongue Jun 19 '19 at 17:11

1 Answers1

0

make several small cforests and combine them into one bigger conditional forest model.

library(snowfall)
library(party)
cforestmt<-function(formula, data = list(), subset = NULL, weights = NULL, controls = cforest_unbiased(), xtrafo = ptrafo, ytrafo = ptrafo, scores = NULL, threads=8) {

    if(controls@ntree<threads) {    # if there are less trees than threads single thread
        return(cforest(formula, data = data, subset=subset, weights=weights, controls=controls, xtrafo=xtrafo, ytrafo=ytrafo, scores=scores))
    }

    # round off threads
    fsize=controls@ntree/threads
    if(fsize-round(fsize)!=0) {
            fsize=ceiling(fsize)
            message("Rounding forest size to ",fsize*threads)
    }
    controls@ntree=as.integer(fsize)

    # run forests in parallel
    sfInit(parallel=T, cpus=threads, type="SOCK")
    sfClusterEval(library(party))
    sfExport('formula','data','subset','weights','controls','xtrafo','ytrafo','scores')
    fr<-sfClusterEval(cforest(formula, data = data, subset=subset, weights=weights, controls=controls, xtrafo=xtrafo, ytrafo=ytrafo, scores=scores))
    sfStop()

    # combine/append forest
    fr[[1]]@ensemble<-unlist(lapply(fr,function(y) {y@ensemble}),recursive=F)
    fr[[1]]@where<-unlist(lapply(fr,function(y) {y@where}),recursive=F)
    fr[[1]]@weights<-unlist(lapply(fr,function(y) {y@weights}),recursive=F)

    #first forest has result
    return(fr[[1]])
}
Nobody
  • 11
  • 1
  • 1
    Hi, do add a bit of explanation along with the code as it helps to understand your code. Code only answers are frowned upon. – Bhargav Rao Nov 15 '16 at 09:50