5

I am relatively new to the machine learning ocean, please excuse me if some of my questions are really basic.

Current situation: The overall goal was trying to improve some code for h2o package in r running on the supercomputer cluster. However, since the data is too large that single node with h2o really takes more than a day, therefore, we have decided to use multiple nodes to run the model. I came up with an idea:

(1) Distribute each node to build (nTree/num_node) trees and saved into a model;

(2) running on the cluster at each node for (nTree/num_node) number of trees in the forest;

(3) Merging the trees back together and reform the original forest, and using the measurement results in average.

I later realized this could be risky. But I cannot find the actual support or against statement since I am not machine learning focused programmer.

Questions:

  1. if this way of handling random forest will result in some risk, please reference me the link so I can have a basic idea why this is not right.
  2. If this way is actually an "ok" way to do so. What should I be do to merge the trees, is there a package or method I can borrow from?
  3. If this is actually a solved problem, please reference me the link, I may have searched the wrong keywords, and thank you!

The real number-involved example I can present here is:

I have a random forest task with 80k rows and 2k columns and wanted the number of trees are 64. What I have done is put 16 trees on each node running with the whole dataset, and each one of four nodes come up with an RF model. I am now trying to merge the trees from each model into this one big RF model and average the measurements (from each of those four models).

windsound
  • 706
  • 4
  • 9
  • 31

2 Answers2

5

There is no need to merge the models. Unlike with boosting methods, every tree in a Random Forest is grown independently (just don't set the same seed prior to kicking off RF on each node!).

You are basically doing what Random Forest does on its own, which is to grow X independent trees and then average across the votes. Many packages provide an option to specify the number of cores or threads, in order to take advantage of this feature of RF.

In your case, since you have the same number of trees per node, you'll get 4 "models" back, but those are really just collections of 16 trees. To use it, I'd just keep the 4 models separate and when you want a prediction, average the prediction from each of the 4 models. Assuming you're going to be doing that more than once, you could write a small wrapper function to predict with the 4 models and average the output.

Tchotchke
  • 3,061
  • 3
  • 22
  • 37
1

10,000 rows by 1,000 columns is not overly large and should not take that long to train an RF model.

It sound like something unexpected is happening.

While you can try to average models if you know what you are doing, I don't think it should be necessary in this case.

TomKraljevic
  • 3,661
  • 11
  • 14