0

I am trying to use mclapply to parallelize cross-validation for modeling fitting procedure for very large design matrix X (~10GB) and response vector y. Let's say X is of dimension n-by-p: n=1000, p=1,000,000. Since X is very huge, it's backed as big.matrix object, stored in disk, and accessed using methods in R package bigmemory.

The workflow for a 4-fold cross validation is as follows.

  1. Set up indices vector cv.ind with length n, which stores a sequence of numbers from 1 to 4, indicating which observation in X belongs to which fold for CV.
  2. Set up 4 cores. In the i-th core, copy corresponding training and test sub matrices for the i-th fold CV.
  3. Fit a model for each fold in each core.
  4. Collect results, return.

The cross-validation function looks like following.

cv.ncvreg <- function(X, y, ncore, nfolds=5, seed, cv.dir = getwd(), 
                      cv.ind) {
  ## some more setup ...
  ## ...
  ## ...
  ## pass the descriptor info to each core ##
  xdesc <- describe(X)
  ## use mclapply instead of parLapply
  fold.results <- parallel::mclapply(X = 1:nfolds, FUN = cvf, XX=xdesc, y=y, 
                                     cv.dir = cv.dir, cv.ind=cv.ind, 
                                     cv.args=cv.args,
                                     mc.set.seed = seed, mc.silent = F, 
                                     mc.cores = ncore, mc.preschedule = F)

  ## return results
}

The R function cvf is run in each core. It copies training/test matrices for i-th fold as two big.matrix objects, fits a model, computes some statistics, and returns results.

cvf <- function(i, XX, y, cv.dir, cv.ind, cv.args) {
  ## argument 'XX' is the descriptor for big.matrix
  # reference to the big.matrix by descriptor info
  XX <- attach.big.matrix(XX)

  cat("CV fold #", i, "\t--Copy training-- Start time: ", format(Sys.time()), "\n\n")
  ## physically copy sub big.matrix for training
  idx.train <- which(cv.ind != i) ## get row idx for i-th fold training
  deepcopy(XX, rows = idx.train, type = 'double',
           backingfile = paste0('x.cv.train_', i, '.bin'),
           descriptorfile = paste0('x.cv.train_', i, '.desc'),
           backingpath = cv.dir)
  cv.args$X <- attach.big.matrix(paste0(cv.dir, 'x.cv.train_', i, '.desc'))
  cat("CV fold #", i, "\t--Copy training-- End time: ", format(Sys.time()), "\n\n")

  cat("CV fold #", i, "\t--Copy test-- Start time: ", format(Sys.time()), "\n\n")
  ## physically copy remaining part of big.matrix for testing
  idx.test <- which(cv.ind == i) ## get row idx for i-th fold testing
  deepcopy(XX, rows = idx.test, type = 'double',
           backingfile = paste0('x.cv.test_', i, '.bin'),
           descriptorfile = paste0('x.cv.test_', i, '.desc'),
           backingpath = cv.dir)
  X2 <- attach.big.matrix(paste0(cv.dir, 'x.cv.test_', i, '.desc'))
  cat("CV fold #", i, "\t--Copy test-- End time: ", format(Sys.time()), "\n\n")

  # cv.args$X <- XX[cv.ind!=i, , drop=FALSE]
  cv.args$y <- y[cv.ind!=i]
  cv.args$warn <- FALSE

  cat("CV fold #", i, "\t--Fit ncvreg-- Start time: ", format(Sys.time()), "\n\n")
  ## call 'ncvreg' function, fit penalized regression model
  fit.i <- ncvreg(X=cv.args$X, y=cv.args$y, family=cv.args$family, 
                  penalty = cv.args$penalty,lambda = cv.args$lambda, convex = cv.args$convex)

  # fit.i <- do.call("ncvreg", cv.args)
  cat("CV fold #", i, "\t--Fit ncvreg-- End time: ", format(Sys.time()), "\n\n")

  y2 <- y[cv.ind==i]
  yhat <- matrix(predict(fit.i, X2, type="response"), length(y2))

  loss <- loss.ncvreg(y2, yhat, fit.i$family)
  pe <- if (fit.i$family=="binomial") {(yhat < 0.5) == y2} else NULL
  list(loss=loss, pe=pe, nl=length(fit.i$lambda), yhat=yhat)
}

Up to this point, the code works very well when the design matrix X is not too large, say n=1000, p=100,000 with size ~1GB. However, if p=1,000,000 hence the size of X becomes ~10GB, the model fitting procedure in each core runs like forever!!!!! (the following portion):

#...
cat("CV fold #", i, "\t--Fit ncvreg-- Start time: ", format(Sys.time()), "\n\n")
## call 'ncvreg' function, fit penalized regression model
fit.i <- ncvreg(X=cv.args$X, y=cv.args$y, family=cv.args$family, 
                penalty = cv.args$penalty,lambda = cv.args$lambda, convex = cv.args$convex)
# fit.i <- do.call("ncvreg", cv.args)
cat("CV fold #", i, "\t--Fit ncvreg-- End time: ", format(Sys.time()), "\n\n")
#...

Notes:

  1. If I run 'ncvreg()' on the raw matrix (10GB) once, it takes ~2.5 minutes.
  2. If I run the cross validation sequentially using for loop but not mclapply, the code works well, the model fitting for each fold 'ncvreg()' works fine too (~2 minutes), though the whole process takes ~25 minutes.
  3. I tried 'parLapply' at first with the same issue. I switched to 'mclapply' due to reason here.
  4. The data copy step (i.e., deepcopy portion) in each core works well, and takes ~2 mins to get the training and test data sets copied and file-backed on disk.
  5. I tried to monitor the CPU usage, below is one screenshot. As we can see, in the left figure, each of 4 rsessions takes ~25% CPU usage, while there is a process, kernel_task uses up ~100% CPU. As time elapses, kernel_task can take even 150% CPU. In addition, the CPU history panel (right bottom) shows that most of the CPU usage is from system, not user, since the red area dominates green area.

enter image description here


My questions:

  1. Why does the model fitting process take forever when in parallel? What could be possible reasons?
  2. Am I on the right track to parallelize the CV procedure for big.matrix? Any alternative ways?

I appreciate any insights for helping fix my issue here. Thanks in advance!!

SixSigma
  • 2,808
  • 2
  • 18
  • 21
  • Ideally you would be using `sub.big.matrix` as I suggested before on previously permuted data but whatever you need to do. By running `deepcopy` you create a copy of that object in memory. I suspect it slows down in parallel because you are copying all 4 at the same time and R isn't able to handle the memory efficiently. Try benchmarking the different steps to find out which step it is hanging on and report back. – cdeterman Dec 01 '15 at 14:20
  • Thanks for the comment, @cdeterman. Actually, ``deepcopy`` does work! ``deepcopy`` does physical copy of the data, and allows file backing on disk. So I don't need to copy the object in the memory. Based on my experiment, the ``deepcopy`` procedure works well, and takes ~2mins. It's the model fitting step in each fold that runs like forever!! – SixSigma Dec 01 '15 at 15:58
  • If you look at the ``deepcopy`` code portion, what I did is to copy the data and have it file-backed in disk by providing the directory and the backing-file names. – SixSigma Dec 01 '15 at 16:00
  • In terms of ``sub.big.matrix``, I've compared it with ``deepcopy``. I found that using ``deepcopy`` is even faster than ``sub.big.matrix`` for creating a sub-matrix out of ``big.matrix`` object when the sub-matrix is also very large and cannot be fit into memory. Not to mention that ``sub.big.matrix`` requires the sub-matrix is contiguous. – SixSigma Dec 01 '15 at 16:07
  • Do you have another implementation of `ncvreg`? Regardless of the CV method, the function from the `ncvreg` package does not accept `big.matrix` objects. You code as it stands is not reproducible. – cdeterman Dec 01 '15 at 19:40
  • Thanks for pointing out that, @cdeterman. Yes. I have extended ``ncvreg`` function such that it can accept ``big.marix`` objects. – SixSigma Dec 01 '15 at 19:43
  • You should provide that as well. Given that you say that is the part that takes longer you should provide it. – cdeterman Dec 01 '15 at 19:45
  • 1
    Or at least just show the part you altered given that the function is rather verbose. – cdeterman Dec 01 '15 at 19:51
  • Yes, as you mentioned, the function ``ncvreg`` is a little messy since it calls some C-level functions to do the tedious computation. Basically what I did was to write some counterpart functions in C++ level which utilize ``bigmemory`` API to handle computation for ``big.matrix``. – SixSigma Dec 02 '15 at 04:21
  • So wherever there is a C function called by ``ncvreg``, you can just imagine there is a C++ counterpart that can handle ``big.matrix`` object. Despite of this, all the R level code are pretty much the same. – SixSigma Dec 02 '15 at 04:25
  • @cdeterman, I am not sure whether that's helpful for you. But if you need to to through the code, I will push it to GitHub soon, though I need to clean it up a little bit. Thanks for your interests in my question. – SixSigma Dec 02 '15 at 04:27

0 Answers0