Massive datasets with the randomForest package

Question

I have about 300,000 rows of data and 10 features in my model and I want to fit a random forest from the randomForest package in R.

To maximise the amount of trees I can get in the forest in a fixed window of time without ruining generalisation what are sensible ranges that I should set the parameters to?

This is more a statistical question than a programming question you should consider migrating this to crossvalidated and you might also want to explore cross-validation to set your parameters! — dickoa, Jan 02 '14 at 17:07
@dickoa This is a time complexity problem. I want to know the ranges of parameter values where time complexity is feasible. I will then use cross validation within the cartesian product of these intervals. — user2763361, Jan 02 '14 at 17:09
I don't see what's preventing you from simply doing some tests on a smaller version of your data to figure this out yourself. — joran, Jan 02 '14 at 17:47

score 2 · Accepted Answer · edited May 23 '17 at 12:28

Usually you can get away with just mtryas explained here and the default is often best:

https://stats.stackexchange.com/questions/50210/caret-and-randomforest-number-of-trees

But there is a function tuneRF with randomForest that will help you find optimal ntree or mtry as explained here:

setting values for ntree and mtry for random forest regression model

The time it takes you will have to test yourself - it's going to be the products of foldstuningntrees.

The only speculative point I would add is that with 300,000 rows of data you might reduce the runtime without loss of predictive accuracy by bootstrapping small samples of the data???

Massive datasets with the randomForest package

1 Answers1