0

So I have tried a few different ways of doing this but each returns an a different error which is making me question if i'm even doing it correctly.

So without any parallel components, we have the following:

all_necks <- lapply(b_list, b_fun)

This works perfectly; b_list is a dataframe and b_fun is a ton of joins and functions which are to be done on the list.

Because each run takes about 5 minutes and there are 550 elements in b_list, I need this to be faster to be practical.

I try future.lapply but get the following error:

library(future.apply)
options(future.globals.maxSize= 178258920000)
plan(multiprocess, workers = 5) ## Parallelize using five cores
all_necks <- future_lapply(b_list, b_fun)

ERROR:
Error in serialize(data, node$con) : error writing to connection

Then I tried foreach and got the following:

library(doParallel) 
cl <- makeCluster(detectCores())
registerDoParallel(cl)
all_necks <- foreach(i = 1:b_list %dopar% {b_fun})


ERROR:
There were 16 warnings (use warnings() to see them)
1: In environment() : closing unused connection 19 (<-DESKTOP-XXX)
2: In environment() : closing unused connection 18 (<-DESKTOP-XXX)
...

I must be doing this incorrectly but I really just want this long lapply to run faster via parallel processing.

I would prefer to do this on 5 cores.

EDIT: Session Info Added

R version 4.0.1 (2020-06-06)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18363)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                           LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] future.apply_1.5.0  future_1.17.0       formattable_0.2.0.1 lubridate_1.7.9     data.table_1.12.8   chron_2.3-55       
 [7] Nmisc_0.3.5         anytime_0.3.7       forcats_0.5.0       stringr_1.4.0       dplyr_1.0.0         purrr_0.3.4        
[13] readr_1.3.1         tidyr_1.1.0         tibble_3.0.1        ggplot2_3.3.2       tidyverse_1.3.0     jsonlite_1.6.1     

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.4.6      lattice_0.20-41   listenv_0.8.0     assertthat_0.2.1  digest_0.6.25     R6_2.4.1         
 [7] cellranger_1.1.0  backports_1.1.7   reprex_0.3.0      evaluate_0.14     httr_1.4.1        pillar_1.4.4     
[13] rlang_0.4.6       readxl_1.3.1      rstudioapi_0.11   furrr_0.1.0       blob_1.2.1        rmarkdown_2.3    
[19] htmlwidgets_1.5.1 munsell_0.5.0     tinytex_0.24      broom_0.5.6       compiler_4.0.1    modelr_0.1.8     
[25] xfun_0.14         pkgconfig_2.0.3   globals_0.12.5    htmltools_0.5.0   tidyselect_1.1.0  codetools_0.2-16 
[31] fansi_0.4.1       crayon_1.3.4      dbplyr_1.4.4      withr_2.2.0       rappdirs_0.3.1    grid_4.0.1       
[37] nlme_3.1-148      gtable_0.3.0      lifecycle_0.2.0   DBI_1.1.0         magrittr_1.5      scales_1.1.1     
[43] cli_2.0.2         stringi_1.4.6     fs_1.4.1          xml2_1.3.2        ellipsis_0.3.1    generics_0.0.2   
[49] vctrs_0.3.1       tools_4.0.1       glue_1.4.1        hms_0.5.3         parallel_4.0.1    colorspace_1.4-1 
[55] rvest_0.3.5       knitr_1.28        haven_2.3.1     


John Thomas
  • 1,075
  • 9
  • 32
  • Have you tried using the `furrr` package? Instead of `lapply`, you would use a variant of the `map` function. – Prayag Gordy Jun 29 '20 at 18:00
  • I am not familiar, but how would that help me? I would still have the issue of speed no? – John Thomas Jun 29 '20 at 18:02
  • The `furrr` package builds on `purrr`'s mapping functions and makes them parallel (also using `plan(multiprocess)`. I find `furrr` a lot more intuitive, so it'd be easier to avoid these errors. – Prayag Gordy Jun 29 '20 at 18:04
  • Hm okay, I haven't tried that but what would I need to change for it to be implemented? – John Thomas Jun 29 '20 at 18:05
  • Hard for me to say without looking at your input, function, and intended output. Here's another answer I recently wrote using `future_pmap`: [https://stackoverflow.com/a/62381045/5885627](https://stackoverflow.com/a/62381045/5885627). – Prayag Gordy Jun 29 '20 at 18:07
  • hm okay; the input is several joins and summarizations of created variables. Each run of the lapply takes a dataframe from the list (b_list) and does all that and then returns a dataframe which has now been manipulated. So in the end I should have a list of manipulated dataframes. – John Thomas Jun 29 '20 at 18:10
  • I think you're looking for `furrr::future_map_dfr`. – Prayag Gordy Jun 29 '20 at 18:14
  • You are on windows ? – Rémi Coulaud Jun 29 '20 at 18:18
  • 1
    @RémiCoulaud yessir, on windows otherwise mcapply wouldve been good. – John Thomas Jun 29 '20 at 18:18
  • @PrayagGordy Okay that ran but then i got this error Error in serialize(data, node$con) : error writing to connection – John Thomas Jun 29 '20 at 18:32
  • Hmm, I'm not sure what the problem is. Did you try Googling your error message? It's hard for me to diagnose this problem. – Prayag Gordy Jun 29 '20 at 18:46
  • yeah so i googled and it said it could be a memory issue hence why my connectors are failing; how could we utilize the fix listed here but with furr? https://stackoverflow.com/questions/28503208/doparallel-error-in-r-error-in-serializedata-nodecon-error-writing-to-con – John Thomas Jun 29 '20 at 18:51
  • 2
    Author of the future framework here: there should be no reason for 'furrr' working better or worse that 'future.apply' here - they both are just very thin map-reduce wrappers on top of the Future API provided by the 'future' package. They share the same weaknesses and strengths. They basically only differ in syntax in the same way `base::lapply()` and `purrr:map()` differ. – HenrikB Jun 29 '20 at 21:41
  • @JohnThomas, please share your `sessionInfo()` - that's critical, minimal information needed in order to give constructive feedback and help. – HenrikB Jun 29 '20 at 21:42
  • @JohnThomas, you're example code on 'future.apply' uses a regular `lapply()` call - there's nothing that use parallelization in that code. – HenrikB Jun 29 '20 at 21:45
  • Forgot to say in my comment regarding furrr and future.apply: the same reasoning applies to using `foreach()` with the doFuture adapter - it's a thin map-reduce API on top of the Future API just like 'furrr' and 'future.apply'. – HenrikB Jun 29 '20 at 21:56
  • @HenrikB Added session info to OP, something is telling me it could be a memory issue ? Truly unsure; loop works perfectly as a normal lapply, but not when I want them all working in parallel. – John Thomas Jun 29 '20 at 22:00
  • Yes, it's likely a memory issue. That's a lot of packages you've got loaded according to your session info. I just wanna make sure: are all those packages need for this code (e.g. are they needed by `b_fun()`) or are some of them left-overs from trial'n'error attempts? Related to this, when you say "`b_list` is a dataframe" - is that a `data.frame` or do you mean `data.table` (which I see in the session info). – HenrikB Jun 29 '20 at 22:51
  • Install the develop version of the 'future'. It's designed to provide a bit more information on critical worker errors like this. You can install it with `remotes::install_github("HenrikBengtsson/future@develop")`. Restart in a fresh R session and rerun. Hopefully, you'll get some more info when this error occurs. – HenrikB Jun 29 '20 at 22:54

1 Answers1

0

Why wouldn't pblapply be perfect here? It works the same as lapply except that you specify a cluster object. It's definitely easier to parallelize with pblapply on linux, since you just specify the cluster as an integer, but you can do it on Windows, too.

This is code I use when I have multiple OS types, so you only need the Windows part, but it might be helpful to see the whole if statement. For the Window branch, I call 2 cluster functions to set up the environment that each node would have. This involves passing any extra functions or variables, or loading up any libraries on each node.

numcores = 5
if(.Platform$OS.type == "unix") {
    cl = makeCluster(numcores, type = "FORK")
} else {
    cl <- makeCluster(numcores, type = "PSOCK")
    # If you're using libraries in each node, you can call this.
    clusterEvalQ(cl, {library(LibraryName)})
    clusterExport(cl, list("var1", "var2"), envir = environment())
}

out = pblapply(b_list, b_fun, cl = cl)
buggaby
  • 359
  • 2
  • 13