0

I'm using a foreach loop to try and speed up some data processing I'm doing. I'd upload the full code, but its about 2k lines long so that doesn't seem worthwhile. Basically, I have a bunch of matrices (15 wide and 300 to 1500 long) that I need to pass through Mplus using mclust. I have a for loop which wraps around the foreach loop, which contains the mclust model fitting. Something like this:

registerDoParallel(4)

for (i in 1:10) {
 if (i==1) {data=load(file.rda)} #I've broken the data into 10 smaller chunks
 if (i==2) ...
 out <- foreach (sim=1:length(data), .packages=c('mclust','MplusAutomation')) %dopar% {
 #Proceed to fit various models in Mplus and saving the important output to a matrix as
 results[1:130] 
#this is so the thing that gets reported is the list of results I need and not a singular value. 
 }
 if (i==1) {save(out, "out.file.rda")}
 if (i==2) ... 
}

Anyway, I know the code works on smaller data batches (for instance, if I tell it to run only on the first ten in each of the datasets, it can run clean through without issue. However, when I ramp this up to running on the full dataset, I get errors like this:

Error in { : task 175 failed - "cannot open the connection"

It seems to happen at different points during the script, not always at the same time/place. I've tried messing with how many cores it uses (4-6), how much data it loads in at any one time (all 6.6 GB at once to 1/10th of that), I've increased the working memory (memory.limit(size=56000)), but none of these changes have allowed the code to run without error. In fact, it's never managed to complete one of the i loops through yet.

Any suggestions?

PSB
  • 39
  • 7
  • R is case sensitive, it's `registerDoParallel`, not what is in the script. – Rui Barradas Nov 02 '21 at 15:51
  • Thanks! I’ll double check that, but that’s most likely a copy error on my part I think. – PSB Nov 02 '21 at 16:28
  • I checked and I do have that correct in the original code. I’ll update my example above – PSB Nov 02 '21 at 16:46
  • Oh, one thing I forgot to mention, the data is being stored on a dropbox folder. I'm not sure if that could be part of the issue. – PSB Nov 02 '21 at 17:15
  • Can't you put `if (i==1) {data=load(file.rda)}` outside the `for` and loop from 2:10? If you are looping to access data in a dropbox then you'll probably see connection related errors. – Rui Barradas Nov 02 '21 at 17:19
  • What would looping that way accomplish? Test if there’s an error in that first chunk of data? I can definitely do it I just want to know why I should think to do that? – PSB Nov 02 '21 at 18:46
  • When `i==1`, the only thing that that branch of the `if` does is to load the file `file.rda`, or am I wrong? – Rui Barradas Nov 02 '21 at 18:55
  • Oooooooo that is actually very helpful. The vague crash reports have not been helpful and the main source of my pain this past wee. Thanks @HenrikB! – PSB Nov 03 '21 at 00:15
  • I retracted (=deleted) my previous comment, because it was most likely not correct; That error message is probably *not* from your parallel worker crashing/dying. Instead, the error is probably related to not some unknown piece of code running on a work not being able to open a connection. Hard to tell without seeing all of your code. – HenrikB Nov 03 '21 at 07:02
  • What's the best way to catch that kind of error? I've tried running it on different subsamples of the data (100 of the first datafile, 20 of each datafile) and the code executed without error. – PSB Nov 03 '21 at 16:18
  • I would use divide'and'conquer to narrow in on what data file is causing the problem, i.e. split data set in half, run the two halves, repeat on the half that produces the error. Eventually, you should have found the problematic file. Like others, I now also suspect it's due to a problematic file. It's unlikely due to parallelization. But hard to tell without you sharing all your code. – HenrikB Nov 04 '21 at 20:04

0 Answers0