0

I am trying to run a function using foreach and %dopar% that will pass its results back into itself for each iteration. Small example below:

require(doParallel)

test_function <- function(data)
{
  result <- rbind(data, data)
  return(result)
}

test_data <- mtcars

cl                          <-          makeCluster(4)
registerDoParallel(cl)
results                     <-          foreach(i = 1:10) %dopar%
{
  aa <- test_function(test_data)
  aa$iteration <- i
  test_data <- aa
  return(aa)
}
stopCluster(cl)

What I am hoping to see in results is a list of ten data frames, each one sequentially doubling in number of rows.

It appears that re-defining test_data within the foreach function does not do this, as it would if I just ran these commands within a standard for loop - like so:

results <- list()
for(i in 1:10)
{
  aa <- test_function(test_data)
  aa$iteration <- i
  test_data <- aa
  results[[i]] <- aa
}

Would appreciate any insight into what I'm overlooking here.

93i7hdjb
  • 1,136
  • 1
  • 9
  • 15
  • I do not know what is your real problem but what you are going to do here is sequential by nature. That is, the first run should finish its job, then the second run and so on. It cannot be done in parallel. – 989 Feb 03 '18 at 00:54
  • Yes, I see this now. The sequence is actually being done in parallel across multiple processors, which doesn't make sense b/c 1 needs to complete before 2 can begin. Thanks! – 93i7hdjb Feb 03 '18 at 01:00
  • Btw, to disable the parallel ability of `foreach`, its enough to use `%do%` in place of `%dopar%`. – 989 Feb 03 '18 at 01:04
  • Testing with parallel bc I need it to solve my actual problem, which I need to rethink now that I realize what's happening here. – 93i7hdjb Feb 03 '18 at 01:06

1 Answers1

0

If I understand your question correctly, your issues are caused because you are unable to update the global variable test_data from within the parallelised for-loop.

To understand why you are being prevented from doing so, consider what is actually happening within the parallelised for-loop: multiple workers running on different threads are performing operations in parallel, each with their own separate, locally-scoped variables. If they had access to any global variable (or shared memory) without any kind of protection that controls access to it, then it would be possible to corrupt whatever is stored in the variable - and there are several different ways this corruption might happen.

Preventing this is the raison d'être of concurrency control structures like semaphores. These allow users to do what you are trying to, but require some care to use correctly.

However, they are not a available in R. Hence, it makes sense that R should protect that global variable test_data from being modified in a non-thread safe manner. It's actually trying to protect your data.

The solution is to rewrite your code to remove any attempt to update global variables (if you still want to do any kind of parallel processing) or switch to using a traditional, sequential for loop (as some commenters have already suggested).

A. G.
  • 159
  • 11