I tried reporting a bug I was running into in mclapply regarding large return values not being allowed.
Apparently the bug has been fixed in development versions, but I'm more interested in the comment that the responder made:
there was a 2GB limit on the size of serialized objects which e.g. mclapply can return from the forked processes and this example is attempting 16GB. That has been lifted (for 64-bit builds) in R-devel, but such usage is very unusual and rather inefficient (the example needs ca 150GB because of all the copies involved in (un)serialization)
If using mclapply to do parallel computation with large data is inefficient, then what is a better way to do it? My need to do this kind of thing is only increasing, and I'm definitely running into bottlenecks everywhere. The tutorials I've seen have been pretty basic introductions on how to use the functions, but not necessarily how to use the functions effectively in managing trade-offs. The documentation has a small blurb on this trade-off:
mc.preschedule: if set to ‘TRUE’ then the computation is first divided to (at most) as many jobs are there are cores and then the jobs are started, each job possibly covering more than one value. If set to ‘FALSE’ then one job is forked for each value of ‘X’. The former is better for short computations or large number of values in ‘X’, the latter is better for jobs that have high variance of completion time and not too many values of ‘X’ compared to ‘mc.cores’
and
By default (‘mc.preschedule = TRUE’) the input ‘X’ is split into as many parts as there are cores (currently the values are spread across the cores sequentially, i.e. first value to core 1, second to core 2, ... (core + 1)-th value to core 1 etc.) and then one process is forked to each core and the results are collected.
Without prescheduling, a separate job is forked for each value of ‘X’. To ensure that no more than ‘mc.cores’ jobs are running at once, once that number has been forked the master process waits for a child to complete before the next fork
Benchmarking these things reliably takes a lot of time since some problems only manifest themselves at scale, and then it is hard to figure out what is going on. So having better insight into the behavior of the functions would be helpful.
edit:
I don't have a specific example, because I use mclapply a lot and wanted to better know how to think about the performance implications. And while writing to disk would get around the error, I don't think it would help in regard to the (de)serialiation that has to occur, which would also have to go through disk IO as well.
One workflow would be as follows:
Take a large sparse matrix M
, and write it to disk in chunks (say M1-M100
) because M itself does not fit in memory.
Now say, for each user i
in I
there are Ci
columns in M
that I want to add up and aggregate at the user level. With smaller data, this would be relatively trivial:
m = matrix(runif(25), ncol=5)
df = data.frame(I=sample(1:6, 20, replace=T), C=sample(1:5, 20, replace=T))
somefun = function(m) rowSums(m)
res = sapply(sort(unique(df$I)), function(i) somefun(m[,df[df$I == i,]$C]))
But with larger data, my approach was to split the data.frame of user/columns into different data.frames based on which matrix M1-M100
the column would be in, do a parallel loop over those data.frames, read in the associated matrix, and then loop over the users, extracting the columns and applying my function and then taking the output list, and looping again and re-aggregating.
This is not ideal if I have a function that can't be reaggregated like that (as of now, this is not a concern), but I'm apparently shuffling too much data around with this approach.