0

I am training a set of glm models using h2o where the very sparse training matrix (4million x 50k) is the same but the response variable (y) is different for each model. The steps I am using are

  1. training matrix is read as a 3col pandas table (row_id, col_id, value) [time: <5s]
  2. scipy.sparse.csc_matrix is created using the table [time: <5s]
  3. train_h2o_orig = h2o.H2OFrame(csc_matrix)
  4. train in this loop
for y in cols:
    train_h2o = train_h2o_orig.cbind(h2o.H2OFrame(y))
    train_h2o[-1] = train_h2o[-1].asfactor()
    glm_h2o = H2PGeneralizedLinearEstimator(family="binomial", nfolds=4, nlambdas=20,
                              lambda_search=True, max_active_predictors=100, seed=12345)
    glm_h2o.train(y=train_h2o.names[-1], training_frame=train_h2o)

Questions:

  1. is there a version of the GLM model training function where the training matrix and response vector can be provided separately (as H2OFrames) so that I do not have to cbind and copy frames around.

  2. the slowest step here is the `h2o.H2OFrame(.) (>30mins). Is there a sparse matrix format which is more efficient (csc? coo? csr?)

  3. in the past I have preferred writing a SVMLight file and reading it back. But with that I have to create 20 of those on disk and read it back. Is create a way of creating that file without the response variable?

Setup: 32cores, 512GB mem, RHEL7 (single user) / Python 3.6.9 / h2o 3.30.0.2 / jre 1.8.0_251

ironv
  • 978
  • 10
  • 25

1 Answers1

0

The answers to your questions:

1 - The response vector will need to be part of the H2OFrame.

2 and 3 - h2o.import_file is the efficient way to create H2O Frames. It is best to use SVMLight file as that it what is supported for sparse datasets.

  • Right but are there any efficiencies to be gained by the fact that the predictors are not changing? So one should have to read the enter matrix every time which is what will happen with the SVMLight import. – ironv May 07 '20 at 22:34