Parallel computing for cross-validation with big.matrix in R

Question

I am trying to use mclapply to parallelize cross-validation for modeling fitting procedure for very large design matrix X (~10GB) and response vector y. Let's say X is of dimension n-by-p: n=1000, p=1,000,000. Since X is very huge, it's backed as big.matrix object, stored in disk, and accessed using methods in R package bigmemory.

The workflow for a 4-fold cross validation is as follows.

Set up indices vector cv.ind with length n, which stores a sequence of numbers from 1 to 4, indicating which observation in X belongs to which fold for CV.
Set up 4 cores. In the i-th core, copy corresponding training and test sub matrices for the i-th fold CV.
Fit a model for each fold in each core.
Collect results, return.

The cross-validation function looks like following.

cv.ncvreg <- function(X, y, ncore, nfolds=5, seed, cv.dir = getwd(), 
                      cv.ind) {
  ## some more setup ...
  ## ...
  ## ...
  ## pass the descriptor info to each core ##
  xdesc <- describe(X)
  ## use mclapply instead of parLapply
  fold.results <- parallel::mclapply(X = 1:nfolds, FUN = cvf, XX=xdesc, y=y, 
                                     cv.dir = cv.dir, cv.ind=cv.ind, 
                                     cv.args=cv.args,
                                     mc.set.seed = seed, mc.silent = F, 
                                     mc.cores = ncore, mc.preschedule = F)

  ## return results
}

The R function cvf is run in each core. It copies training/test matrices for i-th fold as two big.matrix objects, fits a model, computes some statistics, and returns results.

cvf <- function(i, XX, y, cv.dir, cv.ind, cv.args) {
  ## argument 'XX' is the descriptor for big.matrix
  # reference to the big.matrix by descriptor info
  XX <- attach.big.matrix(XX)

  cat("CV fold #", i, "\t--Copy training-- Start time: ", format(Sys.time()), "\n\n")
  ## physically copy sub big.matrix for training
  idx.train <- which(cv.ind != i) ## get row idx for i-th fold training
  deepcopy(XX, rows = idx.train, type = 'double',
           backingfile = paste0('x.cv.train_', i, '.bin'),
           descriptorfile = paste0('x.cv.train_', i, '.desc'),
           backingpath = cv.dir)
  cv.args$X <- attach.big.matrix(paste0(cv.dir, 'x.cv.train_', i, '.desc'))
  cat("CV fold #", i, "\t--Copy training-- End time: ", format(Sys.time()), "\n\n")

  cat("CV fold #", i, "\t--Copy test-- Start time: ", format(Sys.time()), "\n\n")
  ## physically copy remaining part of big.matrix for testing
  idx.test <- which(cv.ind == i) ## get row idx for i-th fold testing
  deepcopy(XX, rows = idx.test, type = 'double',
           backingfile = paste0('x.cv.test_', i, '.bin'),
           descriptorfile = paste0('x.cv.test_', i, '.desc'),
           backingpath = cv.dir)
  X2 <- attach.big.matrix(paste0(cv.dir, 'x.cv.test_', i, '.desc'))
  cat("CV fold #", i, "\t--Copy test-- End time: ", format(Sys.time()), "\n\n")

  # cv.args$X <- XX[cv.ind!=i, , drop=FALSE]
  cv.args$y <- y[cv.ind!=i]
  cv.args$warn <- FALSE

  cat("CV fold #", i, "\t--Fit ncvreg-- Start time: ", format(Sys.time()), "\n\n")
  ## call 'ncvreg' function, fit penalized regression model
  fit.i <- ncvreg(X=cv.args$X, y=cv.args$y, family=cv.args$family, 
                  penalty = cv.args$penalty,lambda = cv.args$lambda, convex = cv.args$convex)

  # fit.i <- do.call("ncvreg", cv.args)
  cat("CV fold #", i, "\t--Fit ncvreg-- End time: ", format(Sys.time()), "\n\n")

  y2 <- y[cv.ind==i]
  yhat <- matrix(predict(fit.i, X2, type="response"), length(y2))

  loss <- loss.ncvreg(y2, yhat, fit.i$family)
  pe <- if (fit.i$family=="binomial") {(yhat < 0.5) == y2} else NULL
  list(loss=loss, pe=pe, nl=length(fit.i$lambda), yhat=yhat)
}

Up to this point, the code works very well when the design matrix X is not too large, say n=1000, p=100,000 with size ~1GB. However, if p=1,000,000 hence the size of X becomes ~10GB, the model fitting procedure in each core runs like forever!!!!! (the following portion):

#...
cat("CV fold #", i, "\t--Fit ncvreg-- Start time: ", format(Sys.time()), "\n\n")
## call 'ncvreg' function, fit penalized regression model
fit.i <- ncvreg(X=cv.args$X, y=cv.args$y, family=cv.args$family, 
                penalty = cv.args$penalty,lambda = cv.args$lambda, convex = cv.args$convex)
# fit.i <- do.call("ncvreg", cv.args)
cat("CV fold #", i, "\t--Fit ncvreg-- End time: ", format(Sys.time()), "\n\n")
#...

Notes:

If I run 'ncvreg()' on the raw matrix (10GB) once, it takes ~2.5 minutes.
If I run the cross validation sequentially using for loop but not mclapply, the code works well, the model fitting for each fold 'ncvreg()' works fine too (~2 minutes), though the whole process takes ~25 minutes.
I tried 'parLapply' at first with the same issue. I switched to 'mclapply' due to reason here.
The data copy step (i.e., deepcopy portion) in each core works well, and takes ~2 mins to get the training and test data sets copied and file-backed on disk.
I tried to monitor the CPU usage, below is one screenshot. As we can see, in the left figure, each of 4 rsessions takes ~25% CPU usage, while there is a process, kernel_task uses up ~100% CPU. As time elapses, kernel_task can take even 150% CPU. In addition, the CPU history panel (right bottom) shows that most of the CPU usage is from system, not user, since the red area dominates green area.

My questions:

Why does the model fitting process take forever when in parallel? What could be possible reasons?
Am I on the right track to parallelize the CV procedure for big.matrix? Any alternative ways?

I appreciate any insights for helping fix my issue here. Thanks in advance!!

Ideally you would be using `sub.big.matrix` as I suggested before on previously permuted data but whatever you need to do. By running `deepcopy` you create a copy of that object in memory. I suspect it slows down in parallel because you are copying all 4 at the same time and R isn't able to handle the memory efficiently. Try benchmarking the different steps to find out which step it is hanging on and report back. — cdeterman, Dec 01 '15 at 14:20
Thanks for the comment, @cdeterman. Actually, ``deepcopy`` does work! ``deepcopy`` does physical copy of the data, and allows file backing on disk. So I don't need to copy the object in the memory. Based on my experiment, the ``deepcopy`` procedure works well, and takes ~2mins. It's the model fitting step in each fold that runs like forever!! — SixSigma, Dec 01 '15 at 15:58
If you look at the ``deepcopy`` code portion, what I did is to copy the data and have it file-backed in disk by providing the directory and the backing-file names. — SixSigma, Dec 01 '15 at 16:00
In terms of ``sub.big.matrix``, I've compared it with ``deepcopy``. I found that using ``deepcopy`` is even faster than ``sub.big.matrix`` for creating a sub-matrix out of ``big.matrix`` object when the sub-matrix is also very large and cannot be fit into memory. Not to mention that ``sub.big.matrix`` requires the sub-matrix is contiguous. — SixSigma, Dec 01 '15 at 16:07
Do you have another implementation of `ncvreg`? Regardless of the CV method, the function from the `ncvreg` package does not accept `big.matrix` objects. You code as it stands is not reproducible. — cdeterman, Dec 01 '15 at 19:40
Thanks for pointing out that, @cdeterman. Yes. I have extended ``ncvreg`` function such that it can accept ``big.marix`` objects. — SixSigma, Dec 01 '15 at 19:43
You should provide that as well. Given that you say that is the part that takes longer you should provide it. — cdeterman, Dec 01 '15 at 19:45
Or at least just show the part you altered given that the function is rather verbose. — cdeterman, Dec 01 '15 at 19:51
Yes, as you mentioned, the function ``ncvreg`` is a little messy since it calls some C-level functions to do the tedious computation. Basically what I did was to write some counterpart functions in C++ level which utilize ``bigmemory`` API to handle computation for ``big.matrix``. — SixSigma, Dec 02 '15 at 04:21
So wherever there is a C function called by ``ncvreg``, you can just imagine there is a C++ counterpart that can handle ``big.matrix`` object. Despite of this, all the R level code are pretty much the same. — SixSigma, Dec 02 '15 at 04:25
@cdeterman, I am not sure whether that's helpful for you. But if you need to to through the code, I will push it to GitHub soon, though I need to clean it up a little bit. Thanks for your interests in my question. — SixSigma, Dec 02 '15 at 04:27

Parallel computing for cross-validation with big.matrix in R

0 Answers0