I am training a set of glm
models using h2o where the very sparse training matrix (4million x 50k
) is the same but the response variable (y) is different for each model. The steps I am using are
- training matrix is read as a 3col pandas table (row_id, col_id, value) [time: <5s]
scipy.sparse.csc_matrix
is created using the table [time: <5s]train_h2o_orig = h2o.H2OFrame(csc_matrix)
- train in this loop
for y in cols:
train_h2o = train_h2o_orig.cbind(h2o.H2OFrame(y))
train_h2o[-1] = train_h2o[-1].asfactor()
glm_h2o = H2PGeneralizedLinearEstimator(family="binomial", nfolds=4, nlambdas=20,
lambda_search=True, max_active_predictors=100, seed=12345)
glm_h2o.train(y=train_h2o.names[-1], training_frame=train_h2o)
Questions:
is there a version of the GLM model training function where the training matrix and response vector can be provided separately (as
H2OFrame
s) so that I do not have to cbind and copy frames around.the slowest step here is the `h2o.H2OFrame(.) (>30mins). Is there a sparse matrix format which is more efficient (csc? coo? csr?)
in the past I have preferred writing a SVMLight file and reading it back. But with that I have to create 20 of those on disk and read it back. Is create a way of creating that file without the response variable?
Setup: 32cores, 512GB mem, RHEL7 (single user) / Python 3.6.9 / h2o 3.30.0.2 / jre 1.8.0_251