I'm working in R, and using the function pblapply() to make parallel processing. I love this function because it shows a progress bar (very useful for estimate very long execution).
Let's say I have a huge dataset, that I split in 500 smaller subdatasets. I will share them through different threads for parallel processing. But if one subdataset generate an error, the whole pblapply() loop failed, and I don't know which of the 500 small subdatasets generated the error. I have to check them one by one. When I do such loop with the R base for() function, I can add print(i)
that will help me locate the error.
Q) Can I do something similar with pblapply(), display a value to tell me which subdataset is currently executing (even if several are displayed at the same time, as several subdatasets are manipulated at the same time by the different threads). It will save my time.
# The example below generate an error, we can guess where because it's very simple.
# With the **pblapply()**, I can't know which part generate the error,
# whereas with the loop, testing one by one, I can find it, but it could be very long with more complex operation.
library(parallel)
library(pbapply)
dataset <- list(1,1,1,'1',1,1,1,1,1,1)
myfunction <- function(x){
print(x)
5 / dataset[[x]]
}
cl <- makeCluster(2)
clusterExport(cl = cl, varlist = c('dataset', 'myfunction'), envir = environment())
result <- pblapply(
cl = cl,
X = 1:length(dataset),
FUN = function(i){ myfunction(i) }
)
stopCluster()
# Error in checkForRemotErrors(vaL) :
# one node produced errors: non-numeric argument to binary operator
for(i in 1:length(dataset)){ myfunction(i) }
# [1] 1
# [1] 2
# [1] 3
# [1] 4
# Error in 5/dataset[[x]] : non-numeric argument to binary operator