Random forest on a big dataset

Question

I have a large dataset in R (1M+ rows by 6 columns) that I want to use to train a random forest (using the randomForest package) for regression purposes. Unfortunately, I get a Error in matrix(0, n, n) : too many elements specified error when trying to do the whole thing at once and cannot allocate enough memory kind of errors when running in on a subset of the data -- down to 10,000 or so observations.

Seeing that there is no chance I can add more RAM on my machine and random forests are very suitable for the type of process I am trying to model, I'd really like to make this work.

Any suggestions or workaround ideas are much appreciated.

Run with `proximity = FALSE` as [joran](http://stackoverflow.com/users/324364/joran) suggested and tell us if it works. — smci, Oct 29 '12 at 07:03
One relatively simple way around your problem would be to subset your input matrix. All that data probably won't give you a better model than one with a subset of size 10K x 10K. — Tim Biegeleisen, Jan 15 '15 at 10:31
Did you have a look at library(h2o) ? That runs OK for very large problems, see http://www.r-bloggers.com/benchmarking-random-forest-implementations/ — Tom Wenseleers, Aug 20 '15 at 18:50

score 11 · Accepted Answer · edited Jan 27 '15 at 14:28

11

You're likely asking randomForest to create the proximity matrix for the data, which if you think about it, will be insanely big: 1 million x 1 million. A matrix this size would be required no matter how small you set sampsize. Indeed, simply Googling the error message seems to confirm this, as the package author states that the only place in the entire source code where n,n) is found is in calculating the proximity matrix.

But it's hard to help more, given that you've provided no details about the actual code you're using.

edited Jan 27 '15 at 14:28

Qbik

5,885
14
62
93

answered Apr 06 '12 at 03:44

joran

169,992
32
429
468

I kind of arrived at the same conclusion but don't seem to understand why it's needed and if there is some way of training the RF without the need for it. – ktdrv Apr 06 '12 at 04:10
1

I'm not sure what you mean. Setting proximity = FALSE will prevent he proximities from being calculated. – joran Apr 06 '12 at 04:15
I just did a test and it's actually the forest itself that's huge. In my particular test case, `keep.forest=F` results in a 14MB result, while `proximity=FALSE` made no difference in or out: the result was 232 MB. – Wayne Nov 12 '14 at 22:25
@Wayne The size of the forest object itself is a separate issue (and not what the OP asked about). The question asked about a specific error that was the result of the inability to allocate enough memory for a single matrix, and the only possible source of that specific error was the proximity matrix. But yes, setting `keep.forest = FALSE` will certainly drastically reduce the size of the resulting object. – joran Nov 12 '14 at 22:35
Now I remember when I had a problem similar to the OP's with `randomForest`: it was using `randomForest` via `caret`. At some point, it wanted to allocate 21 GB -- assuming that the OP was running `randomForest` directly, not an issue. – Wayne Nov 13 '14 at 19:17

Random forest on a big dataset

1 Answers1