0

I have about 300,000 rows of data and 10 features in my model and I want to fit a random forest from the randomForest package in R.

To maximise the amount of trees I can get in the forest in a fixed window of time without ruining generalisation what are sensible ranges that I should set the parameters to?

user2763361
  • 3,789
  • 11
  • 45
  • 81
  • This is more a statistical question than a programming question you should consider migrating this to crossvalidated and you might also want to explore cross-validation to set your parameters! – dickoa Jan 02 '14 at 17:07
  • @dickoa This is a time complexity problem. I want to know the ranges of parameter values where time complexity is feasible. I will then use cross validation within the cartesian product of these intervals. – user2763361 Jan 02 '14 at 17:09
  • I don't see what's preventing you from simply doing some tests on a smaller version of your data to figure this out yourself. – joran Jan 02 '14 at 17:47

1 Answers1

2

Usually you can get away with just mtryas explained here and the default is often best:

https://stats.stackexchange.com/questions/50210/caret-and-randomforest-number-of-trees

But there is a function tuneRF with randomForest that will help you find optimal ntree or mtry as explained here:

setting values for ntree and mtry for random forest regression model

The time it takes you will have to test yourself - it's going to be the products of foldstuningntrees.

The only speculative point I would add is that with 300,000 rows of data you might reduce the runtime without loss of predictive accuracy by bootstrapping small samples of the data???

Community
  • 1
  • 1
Stephen Henderson
  • 6,340
  • 3
  • 27
  • 33