7

I've recently started using R for data analysis. Now I've got a problem in ranking a big query dataset (~1 GB in ASCII mode, over my laptop's 4GB RAM in binary mode). Using bigmemory::big.matrix for this dataset is a nice solution, but providing such a matrix 'm' in the gbm() or randomForest() algorithms causes the error:

cannot coerce class 'structure("big.matrix", package = "bigmemory")' into a data.frame

class(m) outputs the folowing:

[1] "big.matrix"
attr(,"package")
[1] "bigmemory"

Is there a way to correctly pass a big.matrix instance into these algorithms?

cdeterman
  • 19,630
  • 7
  • 76
  • 100
Igor Shalyminov
  • 694
  • 2
  • 8
  • 22
  • 1
    if other solutions fail, you might want to give Revolutions a try. I don't know if it supports `randomForest` or not but I think they have support for large memory necessities. See, for example, `http://www.revolutionanalytics.com/products/enterprise-big-data.php`. Note that it is proprietary software. There is a free academic version. – Xu Wang Nov 29 '11 at 18:30
  • 1
    Can you provide the actual `gbm` and `randomForest` calls you're using? Specifically, are you using the formula interface for `randomForest`? – joran Nov 29 '11 at 18:30

2 Answers2

12

I obviously can't test this using data of your scale, but I can reproduce your errors by using the formula interface of each function:

require(bigmemory)
m <- matrix(sample(0:1,5000,replace = TRUE),1000,5)
colnames(m) <- paste("V",1:5,sep = "")

bm <- as.big.matrix(m,type = "integer")

require(gbm)
require(randomForest)

#Throws error you describe
rs <- randomForest(V1~.,data = bm)
#Runs without error (with a warning about the response only having two values)
rs <- randomForest(x = bm[,-1],y = bm[,1])

#Throws error you describe
rs <- gbm(V1~.,data = bm)
#Runs without error
rs <- gbm.fit(x = bm[,-1],y = bm[,1])

Not using the formula interface for randomForest is fairly common advice for large data sets; it can be quite inefficient. If you read ?gbm, you'll see a similar recommendation steering you towards gbm.fit for large data as well.

joran
  • 169,992
  • 32
  • 429
  • 468
  • Can I convert a `data.frame` to `big.matrix` using `as.big.matrix`? Because when I convert I cannot access the elements of the converted 'big.matrix' as in: `> cp2006.big.matrix<-as.big.matrix(cp.2006) Warning message: In as.big.matrix(cp.2006) : Coercing data.frame to matrix via factor level numberings. > class(cp.2006) [1] "data.frame" > class(cp2006.big.matrix) [1] "big.matrix" attr(,"package") [1] "bigmemory" > cp2006.big.matrix An object of class "big.matrix" Slot "address": ` – Mona Jalal May 11 '14 at 20:38
2

It is often the case that the memory occupied by numeric objects is more than the disk space. Each "double" element in a vector or matrix takes 8 bytes. When you coerce an object to a data.frame, it may need to be copied in RAM. You should avoid trying to use functions and data structures that are outside those supported by the bigmemory/big*** suite of packages. "biglm" is available, but I doubt that you can expect gbm() or randomForest() to recognize and use the facilities in the "big"-family.

IRTFM
  • 258,963
  • 21
  • 364
  • 487