Minimum number of rows in data set for accurate predictions

Question

I am running glmnet favoring lasso regression on a 16 core machine. I have some 800K rows with around 2K columns in a sparse matrix format that should be trained to predict probability in first column.

This process has become very slow. I want to know, is there a way to speed it up either by parallelizing on nfolds or if I can select a smaller number of rows without affecting the accuracy. Is it possible? If so, what would be better?

Take a look at : http://stackoverflow.com/questions/21698435/executing-glmnet-in-parallel-in-r — Joris Meys, Sep 03 '14 at 13:16

score 1 · Accepted Answer · edited May 23 '17 at 12:28

The process can be expedited by using parallelization, which as explained in comment link above executing glmnet in parallel in R is done by setting parallel=TRUE option in cv.glmnet() function, once you specify the number of cores like this:

library(doParallel)
registerDoParallel(5)
m <- cv.glmnet(x, y, family="binomial", alpha=0.7, type.measure="auc",
           grouped=FALSE, standardize=FALSE, parallel=TRUE)

Reducing the number of rows is more of a judgement call based on AUC value on test set. If it is above threshold, and reducing rows does not affect this, then it is certainly a good idea.

Minimum number of rows in data set for accurate predictions

1 Answers1