2

I think I've completely misunderstood how foreach parallel operations work.

In the following example is foreach running 7 independant threads of foo(DF[i,]) for different values of i which leapfrog each other to get the next available row? Splitting the computation of foo(DF[i,]) for a single value of i between 7 threads? Or is it replicating the same operation of foo(DF[i,]) for the same value of i 7 separate times?

Additionally: Is it possible to accomplish something like the first scenario, where several independant threads collectively iterate over the rows (or chunks of rows) of a dataframe in a sort of parallel-serialized approach? Or is the only option subsetting ahead of time and assigning each subset to a separate thread?

registerDoParallel(7) ##4 physical cores, 8 logical cores 

foreach(
    i=seq(nrow(DF),
    ), 
        .packages= c("data.table","tidyverse")
        ) %dopar%
foo(DF[i,])
D3SL
  • 117
  • 8

1 Answers1

3

It's like having 7 copies of DF, and the iterations of i are split across different cores. So, core 1 will do i=1 with the first copy of DF, core 2 will do i=2 with the second copy. When one core is finished with its first iteration, it will compute for i=8 with its own copy of DF.

You can look at what's going on using e.g.

library(foreach)
library(doParallel)

tmp <- tempfile()
cl <- makeCluster(7, outfile = tmp)
registerDoParallel(cl) ##4 physical cores, 8 logical cores 

DF <- iris[-5]

foreach(i = seq(nrow(DF)), .combine = "c") %dopar% {
  cat("Processing i =", i, "with PID", Sys.getpid(), "\n")
  sum(DF[i, ])
}
readLines(tmp)
F. Privé
  • 11,423
  • 2
  • 27
  • 78
  • FYI, if you use `library(doFuture); registerDoFuture(); plan(multisession, workers = 7)`, then `cat()`, `message()`, etc. will "just" work and there's no need to use the `outfile` hack. – HenrikB May 12 '20 at 23:32