What is foreach %dopar% actually doing when applied to a dataframe as in df[i,]

Question

I think I've completely misunderstood how foreach parallel operations work.

In the following example is foreach running 7 independant threads of foo(DF[i,]) for different values of i which leapfrog each other to get the next available row? Splitting the computation of foo(DF[i,]) for a single value of i between 7 threads? Or is it replicating the same operation of foo(DF[i,]) for the same value of i 7 separate times?

Additionally: Is it possible to accomplish something like the first scenario, where several independant threads collectively iterate over the rows (or chunks of rows) of a dataframe in a sort of parallel-serialized approach? Or is the only option subsetting ahead of time and assigning each subset to a separate thread?

registerDoParallel(7) ##4 physical cores, 8 logical cores 

foreach(
    i=seq(nrow(DF),
    ), 
        .packages= c("data.table","tidyverse")
        ) %dopar%
foo(DF[i,])

score 3 · Answer 1 · answered May 12 '20 at 11:08

It's like having 7 copies of DF, and the iterations of i are split across different cores. So, core 1 will do i=1 with the first copy of DF, core 2 will do i=2 with the second copy. When one core is finished with its first iteration, it will compute for i=8 with its own copy of DF.

You can look at what's going on using e.g.

library(foreach)
library(doParallel)

tmp <- tempfile()
cl <- makeCluster(7, outfile = tmp)
registerDoParallel(cl) ##4 physical cores, 8 logical cores 

DF <- iris[-5]

foreach(i = seq(nrow(DF)), .combine = "c") %dopar% {
  cat("Processing i =", i, "with PID", Sys.getpid(), "\n")
  sum(DF[i, ])
}
readLines(tmp)

FYI, if you use `library(doFuture); registerDoFuture(); plan(multisession, workers = 7)`, then `cat()`, `message()`, etc. will "just" work and there's no need to use the `outfile` hack. — HenrikB, May 12 '20 at 23:32

What is foreach %dopar% actually doing when applied to a dataframe as in df[i,]

1 Answers1