Minimizing overhead with parallel functions in R

Question

I tried reporting a bug I was running into in mclapply regarding large return values not being allowed.

Apparently the bug has been fixed in development versions, but I'm more interested in the comment that the responder made:

there was a 2GB limit on the size of serialized objects which e.g. mclapply can return from the forked processes and this example is attempting 16GB. That has been lifted (for 64-bit builds) in R-devel, but such usage is very unusual and rather inefficient (the example needs ca 150GB because of all the copies involved in (un)serialization)

If using mclapply to do parallel computation with large data is inefficient, then what is a better way to do it? My need to do this kind of thing is only increasing, and I'm definitely running into bottlenecks everywhere. The tutorials I've seen have been pretty basic introductions on how to use the functions, but not necessarily how to use the functions effectively in managing trade-offs. The documentation has a small blurb on this trade-off:

mc.preschedule: if set to ‘TRUE’ then the computation is first divided to (at most) as many jobs are there are cores and then the jobs are started, each job possibly covering more than one value. If set to ‘FALSE’ then one job is forked for each value of ‘X’. The former is better for short computations or large number of values in ‘X’, the latter is better for jobs that have high variance of completion time and not too many values of ‘X’ compared to ‘mc.cores’

and

By default (‘mc.preschedule = TRUE’) the input ‘X’ is split into as many parts as there are cores (currently the values are spread across the cores sequentially, i.e. first value to core 1, second to core 2, ... (core + 1)-th value to core 1 etc.) and then one process is forked to each core and the results are collected.

Without prescheduling, a separate job is forked for each value of ‘X’. To ensure that no more than ‘mc.cores’ jobs are running at once, once that number has been forked the master process waits for a child to complete before the next fork

Benchmarking these things reliably takes a lot of time since some problems only manifest themselves at scale, and then it is hard to figure out what is going on. So having better insight into the behavior of the functions would be helpful.

edit:

I don't have a specific example, because I use mclapply a lot and wanted to better know how to think about the performance implications. And while writing to disk would get around the error, I don't think it would help in regard to the (de)serialiation that has to occur, which would also have to go through disk IO as well.

One workflow would be as follows: Take a large sparse matrix M, and write it to disk in chunks (say M1-M100) because M itself does not fit in memory.

Now say, for each user i in I there are Ci columns in M that I want to add up and aggregate at the user level. With smaller data, this would be relatively trivial:

m = matrix(runif(25), ncol=5)
df = data.frame(I=sample(1:6, 20, replace=T), C=sample(1:5, 20, replace=T))
somefun = function(m) rowSums(m)
res = sapply(sort(unique(df$I)), function(i) somefun(m[,df[df$I == i,]$C]))

But with larger data, my approach was to split the data.frame of user/columns into different data.frames based on which matrix M1-M100 the column would be in, do a parallel loop over those data.frames, read in the associated matrix, and then loop over the users, extracting the columns and applying my function and then taking the output list, and looping again and re-aggregating.

This is not ideal if I have a function that can't be reaggregated like that (as of now, this is not a concern), but I'm apparently shuffling too much data around with this approach.

Without knowing the structure of your problem, it's hard to tell how you could do better. I have looked at your example in the bug report and it's probably just to demonstrate that you could not return large objects. You might want to look into shared memory packages. You would need to post a minimal reproducible example that captures the structure of your problem, yet is simple enough to understand. I don't think that job scheduling (like in `mc.preschedule`) will get you anywhere. — cryo111, Sep 03 '16 at 09:46
I've posted a simple example of what I'm doing, but the complications come in from all that is happening with disk IO, serialization, and such problems. Is what I'm trying to do clear? — James, Sep 05 '16 at 21:45

cryo111 · Answer 1 · 2016-09-09T10:26:30.827

I hope my answer is not too late, but I think your example can be handled by using shared memory/files via the bigmemory package.

Let's create the data

library(bigmemory)
library(parallel)

#your large file-backed matrix (all values initialized to 0)
#it can hold more than your RAM as it is written to a file
m=filebacked.big.matrix(nrow=5,
                        ncol=5,
                        type="double",
                        descriptorfile="file_backed_matrix.desc",
                        backingfile="file_backed_matrix",
                        backingpath="~")

#be careful how to fill the large matrix with data
set.seed(1234)
m[]=c(matrix(runif(25), ncol=5))
#print the data to the console
m[]

#your user-col mapping
#I have added a unique idx that will be used below
df = data.frame(unique_idx=1:20,
                I=sample(1:6, 20, replace=T),
                C=sample(1:5, 20, replace=T))

#the file-backed matrix that will hold the results
resm=filebacked.big.matrix(nrow=nrow(df),
                           ncol=2,
                           type="double",init = NA_real_,
                           descriptorfile="res_matrix.desc",
                           backingfile="res_backed_matrix",
                           backingpath="~")

#the first column of resm will hold the unique idx of df
resm[,1]=df$unique_idx
resm[]

Now, let's move to the function that you want to execute. You wrote rowSums but inferring from your text you meant colSums. I changed that accordingly.

somefun = function(x) {
  #attach the file-backed big.matrix
  #it makes the matrix "known" to the R process (no copying involved!)
  #input
  tmp=attach.big.matrix("~/file_backed_matrix.desc")
  #output
  tmp_out=attach.big.matrix("~/res_matrix.desc")

  #store the output in the file-backed matrix resm
  tmp_out[x$unique_idx,2]=c(colSums(tmp[,x$C,drop=FALSE]))
  #return a little more than the colSum result
  list(pid=Sys.getpid(),
       user=x$I[1],
       col_idx=x$C)
}

Do the parallel calculation on all cores

#perform colSums using different threads
res=mclapply(split(df,df$I),somefun,mc.cores = detectCores())

Check results

#processes IDs
unname(sapply(res,function(x) x$pid))
#28231 28232 28233 28234 28231 28232

#users
unname(sapply(res,function(x) x$user))
#1 2 3 4 5 6 

#column indexes
identical(sort(unname(unlist(sapply(res,function(x) x$col_idx)))),sort(df$C))
#[1] TRUE

#check result of colSums
identical(lapply(split(df,df$I),function(x) resm[x$unique_idx,2]),
          lapply(split(df,df$I),function(x) colSums(m[,x$C,drop=FALSE])))
#[1] TRUE

Edit: I have addressed your comment in my edit. Storing the results in the file-backed output matrix resm works as expected.

I tried using the Rdsm and bigmemory packages to do exactly this, writing the result of each process to a column of the matrix. I had set it up so that each process would only write to a column once. However, the results ended up not being the same in small toy examples, so it seemed like writing different columns in the same matrix through shared memory was not a thread-safe operation? I guess there may have been a bug in my code. Will try to post a writeup of that attempt later. Thanks! — James, Sep 08 '16 at 23:16
@James I have addressed your comment in my edit. My answer now also includes output to a shared matrix. The code works as expected. — cryo111, Sep 09 '16 at 10:27

score 0 · Answer 2 · answered Sep 03 '16 at 14:54

To limit overhead for moderately large N, it is almost always better to use mc.preschedule = TRUE (i.e. split the work in as many chunks as there are cores).

It seems your main tradeoff is between memory use and CPU. That is, you can only parallelize until the ongoing processes max out your RAM. One thing to consider is that the different workers can read the same object in your R session without duplication. So only the objects modified / created in the parallel function call have their memory footprint added up for each core.

If you max out memory, my suggestion would be to divide your whole computation into a number of subjobs and loop around those sequentially (with a lapply, for example), calling mclapply within that loop to parallelize each subjob, and perhaps saving the output of the subjob to disk to avoid keeping it all in memory.

Minimizing overhead with parallel functions in R

2 Answers2