1

I currently encountered a weird situation with a loop that I parallelize using mclapply.

The parallel calls sometimes return NULLs with mclapply, but everything works when I use lapply.

Things also works out ok with mclapply but only if I don't use data.table for subsetting in the function called.

I don't have a reasonable mve yet that I could post here, but could provide the code upon request.

The simplified general structure looks like this:

foo <- function(d) { # d is a data.table
    unlist(mclapply(1:nrow(d), function(i) bar(d[-i])))
}


bar <- function(d) {    
    ...
    ## this version fails:
    pdists <- lapply(unique(d$comp),
                     function(cc) dist(d[d$comp==cc,.(X,Y)]))
    ## this also fails:
    pdists <- lapply(unique(d$comp),
                     function(cc) dist(d[cc, .(X,Y), on="comp"]))
    ## this way it works:
    pdists <- lapply(unique(d$comp),
                     function(cc) dist(d[d$comp==cc,c("X","Y")]))
    ...
}

When looking at what mclapply returns and checking which elements are NULL, I get:

  write error, closing pipe to the master
  [1] FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE TRUE FALSE
 [13] FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE TRUE FALSE
 [25] FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE TRUE FALSE
...
[337] FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE TRUE FALSE
[349] FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE TRUE FALSE

This almost looks like one of the four threads dies (I use mc.cores=4).

Are there issues with thread safety in data.table?

(I have reproduced the problem on two different computers)

> sessionInfo()                                                                                                                                           
R version 3.5.2 (2018-12-20)                                                                                                                              
Platform: x86_64-pc-linux-gnu (64-bit)                                                                                                                    
Running under: Ubuntu 18.04.2 LTS                                                                                                                         

Matrix products: default                                                                                                                                  
BLAS: /usr/lib/x86_64-linux-gnu/atlas/libblas.so.3.10.3                                                                                                   
LAPACK: /usr/lib/x86_64-linux-gnu/atlas/liblapack.so.3.10.3                                                                                               

locale:                                                                                                                                                   
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C                                                                                                              
 [3] LC_TIME=en_DK.utf8         LC_COLLATE=en_US.UTF-8                                                                                                    
 [5] LC_MONETARY=de_CH.UTF-8    LC_MESSAGES=en_US.UTF-8                                                                                                   
 [7] LC_PAPER=de_CH.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=de_CH.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.12.0

loaded via a namespace (and not attached): [1] compiler_3.5.2 tools_3.5.2

Update: Based on the comment by @jangorecki, I added setDTthreads(1), but the error still occurs. I again tried different versions:

## works:
pdists <- lapply(split(d[,.(comp,X,Y)], by="comp", keep.by=FALSE), FUN=dist) 

## these fail:
pdists <- lapply(unique(d$comp), function(cc) dist(d[cc, .(X,Y), on="comp"]))
pdists <- lapply(unique(d$comp), function(cc) dist(d[comp==cc,.(X,Y)])) 

Update 2: Interestingly, timing plays a role. By introducing random delays in the called function bar, and having mc.preschedule = FALSE as argument to mclapply, the number of calls that fail varies.

It always is the third call that fails (with mc.cores>=3), plus a number of consecutive calls. The corresponding values in the list returned by mclapply are NULL.

I also see "Error in sendMaster(try(eval(expr, env), silent = TRUE)) : write error, closing pipe to the master" for these calls. What I find disturbing is that these calls silently fail, without stopping execution.

user52366
  • 1,035
  • 1
  • 10
  • 21
  • why not `d[comp==cc, ...]`? why not `lapply(split(d[,c("comp","X","Y")], by="comp"), dist)`? what is `tmp`? – jangorecki Mar 02 '19 at 14:14
  • tmp should be d here -> corrected. Also good point lapply/split. The question remains: Why does one of the workers return NULL (because it dies I presume). – user52366 Mar 02 '19 at 14:27
  • could you check if issue persist if you set `setDTthreads(1)`? double check setting that in fresh session before calling your script – jangorecki Mar 02 '19 at 14:32
  • @jangorecki: I updated the question with tests based on your suggestion. setting DTthreads to 1 does not solve the issue. – user52366 Mar 04 '19 at 07:47
  • I am facing a similar situation where I applied mclapply over a data.table split into 30000 blocks and with 20 cores. So is there any solution to this? – Bogaso Aug 28 '19 at 12:26
  • I am similarly afflicted and hoping you solved your issue and have a solution/workaround you came up with. Here is another person similarly afflicted: https://stackoverflow.com/questions/52745779/mclapply-encounters-errors-depending-on-core-id What we three have in common is that the spacing between the erroneous results is equal to the number of cores allocated to mclapply. – malcook Sep 15 '19 at 03:23

1 Answers1

0

Not having access to your data.table, I expect your problem is that dist is failing for some of the groups (perhaps they are too small), and that all other groups assigned to the same core are "tainted" by the one group an returning an error, as is the documented behavior of mclapply, as more fully described in https://stackoverflow.com/a/57979216/415228.

malcook
  • 1,686
  • 16
  • 16