This is a somewhat generic question for which I apologize, but I can't generate a code example that reproduces the behavior. My question is this: I'm scoring a largish data set (~11 million rows with 274 dimensions) by subdividing the data set into a list of data frames and then running a scoring function on 16 cores of a 24 core Linux server using mclapply. Each data frame on the list is allocated to a spawned instance and scored, returning a list of data frames of predictions. While the mclapply is running the various R instances are spending a lot of time in uninterruptable sleep, more than they're spending running. Has anyone else experienced this using mclapply? I'm a Linux neophyte, from an OS perspective does this make any sense? Thanks.
Asked
Active
Viewed 539 times
2
-
That is not "sleep." You might try `Rprof` -- or there's a similar tool for multicore process tracking -- to see which operations are taking a lot of time. – Carl Witthoft Feb 05 '14 at 18:44
-
1You need to be careful when using mclapply to operate on large data sets. You should check to see if you're low on memory and swapping badly. Does it work any better when using fewer cores? – Steve Weston Feb 05 '14 at 19:36
-
I'm a little low on swap, but I think the issue is just memory in general. It's very highly utilized in this operation and that seems to be the bottleneck. My list is currently the same size as my core count, if I make the list longer or utilize fewer cores will it help? What role does prescheduling play? – TomR Feb 05 '14 at 20:23
1 Answers
2
You need to be careful when using mclapply to operate on large data sets. It's easy to create too many workers for the amount of memory on your computer and the amount of memory used by your computation. It's hard to predict the memory requirements due to the complexity of R's memory management, so it's best to monitor memory usage carefully using a tool such as "top" or "htop".
You may be able to decrease the memory usage by splitting your work into more but smaller tasks since that may reduce the memory needed by the computation. I don't think that the choice of prescheduling affects the memory usage much, since mclapply will never fork more than mc.cores
workers at a time, regardless of the value of mc.prescheduling
.

Steve Weston
- 19,197
- 4
- 59
- 75
-
Thanks, that's what I ended up doing. I just added a layer of lists so instead of splitting my massive scoring set up into a list of 16 frames I split it into a list of lists of 16 frames, and it seems to be running fine. – TomR Feb 06 '14 at 20:09