3

I am working on a moderate data set (train_data). There are more 124 variables and 50,00,000 observations. For categorical variables, I have used feature hashing on it through hashed.model.matrix function in R.  

## feature hashing
b <- 2 ^ 22
f <- ~ .-1
X_train <- hashed.model.matrix(f, train_data, hash.size=b)

So, as a result , I have got a large dgCmatrix (a sparse matrix) as output (X_train). How can I use, H2o wrapper  on  this matrix and use different algorithms available in H2o ? Does H2o wrapper take sparse matrix (dgCmatrix). Any link / example of such usage will be helpful. Thanks in anticipation.

Looking forward to import X_train in H2o environment to do dollowing type of steps

# initialize connection to H2O server
  h2o.init(nthreads = -1)
 train.hex <- h2o.uploadFile('./X_train', destination_frame='train')

# list of features for training
feature.names <- names(train.hex)

# train random forest model, use ntrees = 500 
drf <- h2o.randomForest(x=feature.names, y='outcome', training_frame,train.hex, ntrees =500)
Harry
  • 198
  • 12

1 Answers1

2

you could save your sparse matrix to svmlight sparse format, then use

train.hex <- h2o.uploadFile('./X_train', parse_type = "SVMLight", destination_frame='train')

svmlight sparse format will also be detected by h2o.importFile(), which is a parallelized reader and pulls information from the server from a location specified by the client.

train.hex <- h2o.importFile('./X_train', destination_frame='train')
Lauren
  • 5,640
  • 1
  • 13
  • 19
  • yes the second example should have said h2o.importFile, thanks for catching that. I'll edit it. – Lauren Aug 11 '16 at 17:26