0

I am running glmnet favoring lasso regression on a 16 core machine. I have some 800K rows with around 2K columns in a sparse matrix format that should be trained to predict probability in first column.

This process has become very slow. I want to know, is there a way to speed it up either by parallelizing on nfolds or if I can select a smaller number of rows without affecting the accuracy. Is it possible? If so, what would be better?

notrai
  • 696
  • 1
  • 5
  • 15

1 Answers1

1

The process can be expedited by using parallelization, which as explained in comment link above executing glmnet in parallel in R is done by setting parallel=TRUE option in cv.glmnet() function, once you specify the number of cores like this:

library(doParallel)
registerDoParallel(5)
m <- cv.glmnet(x, y, family="binomial", alpha=0.7, type.measure="auc",
           grouped=FALSE, standardize=FALSE, parallel=TRUE)

Reducing the number of rows is more of a judgement call based on AUC value on test set. If it is above threshold, and reducing rows does not affect this, then it is certainly a good idea.

Community
  • 1
  • 1
notrai
  • 696
  • 1
  • 5
  • 15