0

I'm trying to read multiple large csv files with nested parallelism with future.

I have a single machine with 32 cores, and I want to set up nested parallel (5 by 6) with outer 5 process with 6 cores each. I'm trying to utilize implicit parallelism from data.table::fread(.., nThreads = 6).

The R package future provides nested parallelism, and I've tried

library(future)
plan(list(tweak(multisession, workers = 5), tweak(multisession, workers = 6)))

but above is actually using only 1 cores for each subprocess:

plan(list(tweak(multisession, workers = 5), 
          tweak(multisession, workers = 6)))
registerDoFuture()
foreach(i = 1:5) %dopar%  {
  availableCores()
}

[[1]]
mc.cores 
       1 

[[2]]
mc.cores 
       1 

[[3]]
mc.cores 
       1 

[[4]]
mc.cores 
       1 

[[5]]
mc.cores 
       1 

Is there a way to achieve this?

HenrikB
  • 6,132
  • 31
  • 34
Matthew Son
  • 1,109
  • 8
  • 27
  • 1
    Why not give [vroom](https://vroom.r-lib.org/) a try. There is no need to get hands dirty via messing around with all the details in `future`. – Su Na Mar 31 '23 at 02:45

1 Answers1

1

(Futureverse maintainer here)

... but above is actually using only 1 cores for each subprocess:

I see the misunderstanding here. You want to use nbrOfWorkers() (from future) here instead of availableCores() (from parallelly - reexported as-is from future). This will give you what you'd expected:

> foreach(i = 1:5) %dopar% {
  nbrOfWorkers()
}
[[1]]
[1] 6
...
[[5]]
[1] 6

The reason for availableCores() returning one (1) is because the future framework tries to prevent nested parallelization by mistake. It does this by setting options and environment variables that controls number of parallel workers and CPU cores, including options(mc.cores = 1L). This is correctly picked up by availableCores(). This prevents, for instance, a package that uses y <- mclapply(X, FUN), cl <- makeCluster(avaiableCores()), or plan(multisession) from running in parallel if already running in a parallel worker. In contrast, nbrOfWorkers() reflects the number of workers specified by plan(). In your case, we have plan(multisession, workers = 6) set in the parallel workers, from the second level in plan(list(tweak(multisession, workers = 5), tweak(multisession, workers = 6))).

To convince yourself you're indeed are running in parallel with your setup, you can adopt one of the examples in https://future.futureverse.org/articles/future-3-topologies.html.

Now, parallel threads are not the same as parallel processes (aka parallel workers). You can think of threads as a much lower-level parallelization mechanism. Importantly, the future framework does not constrain the number of threads used in parallel workers, including the number of parallel threads that data.table uses. Because of this, you need to explicitly call:

data <- data.table::fread(.., nThreads = 6)

or, if you want to be agile to the current settings,

data <- data.table::fread(.., nThreads = nbrOfWorkers())

to avoid over-parallelization. Alternatively, you can reconfigure data.table as:

## Set the number of parallel threads used by 'data.table'
## (the default is to use all physical CPU cores)
data.table::setDTthreads(nbrOfWorkers())
data <- data.table::fread(..;)

BTW, in doFuture (>= 1.0.0), you longer need registerDoFuture() if you replace %dopar% with %dofuture%. So, the gist of reading lots of CSV files in parallel is:

library(doFuture)
plan(list(tweak(multisession, workers = 5), 
          tweak(multisession, workers = 6)))

files <- dir(pattern = "*.csv$")
res <- foreach(file = files) %dofuture% {
  data.table::setDTthreads(nbrOfWorkers())
  data.table::fread(file)
}

With all that said, note that your bottleneck will probably be the file system rather than the CPU. When you parallelize reading files, you might overwhelm the file system and end up slowing down the file reading rather than speeding it up. Sometimes it gets faster to read two-three files in parallel, but with more it becomes counterproductive. So, you need to benchmark with different number of parallel workers.

Moreover, these days, there are R packages that are highly-specialized for reading data files into R efficiently. Some of them supports reading multiple files efficiently. The vroom package is one such example.

HenrikB
  • 6,132
  • 31
  • 34
  • Perhaps a silly question, but in this case is the required number of cores 6*5 = 30 or 6*5 + 5 = 35? – jflournoy Apr 13 '23 at 20:48
  • 1
    There are no such things as "silly questions". When using `plan(multsession, workers = 3)`, there are indeed 1 + 5 = 6 R processes running. Ideally, the main R process will be mostly idle, because it just sits there waiting for the 5 workers to send back the results. Analogously, with `plan(list(tweak(multisession, workers=5), tweak(multisession, workers=6)))`, there are 1 + 5*(1 + 6) = 1 + 5*7 = 1 + 35 = 36 R processes running, but 1 + 5*1 = 6 of them are idle because they just wait for their "child" workers to return results. The remaining 5*6 process will do all the work ("use the cores") – HenrikB Apr 13 '23 at 21:43
  • So let's say I was working on a SLURM cluster and I had to reserve cores, it sounds like I could get away with reserving 30 without breaking anything, but might prefer to reserve 36? – jflournoy Apr 13 '23 at 22:25
  • 1
    Yes, it's sufficient to reserve 30 CPU cores ("slots"). That's what I do on our Slurm & SGE clusters. FWIW, it's likely your cluster is configured so your jobs run such that they can only access the number of CPU cores requested. This is done via Linux CGroup and means that no matter how hard you try, you will never be able to use more than you got access too. In other words, those 36 R processes will only get a sandbox with 30 cores to play with. It's not worth asking for 36 slots, because then it's likely that you only end up using 30/36 = 83% of the CPU resource allotted to the job. – HenrikB Apr 14 '23 at 01:55