3

I am writing a function that processes several very large data.tables and I want to parallelize this function on a Windows machine.

I could do this with the snow package using clusterExport to create a copy of each of the data.tables for each node in the cluster. However this does not work because it uses too much memory.

I want to fix this by exporting a different subset of the data.tables to each node, however, I can't see how to do this in the snow package.

Here is a toy example of code that works but is memory inefficient:

library(snow)
dd <- data.frame(a = rep(1:5, each = 2), b = 11:20)
cl <- makeCluster(2, type = "SOCK")
clusterExport(cl = cl, "dd")
clusterApply(cl, x = c(2,7),  function(thresh) colMeans(dd[dd$a < thresh,]))
stopCluster(cl)

Here is an example of code that does not work but explains how we would like to distribute subsets of dd to the nodes:

library(snow)
dd <- data.frame(a = rep(1:5, each = 2), b = 11:20)
cl <- makeCluster(2, type = "SOCK")

dd_exports <- lapply(c(2,7), function(thresh) dd[dd$a < thresh])
#Now we export the ith element of dd_exports to the ith node:
clusterExport(cl = cl, dd_exports) 
clusterApply(cl, x = c(2,7),  function(x) colMeans(dd))
stopCluster(cl)
orizon
  • 3,159
  • 3
  • 25
  • 30
  • Please provide some code you tried with some data. – F. Privé May 11 '18 at 20:22
  • @f-privé I have added some code in an attempt to better explain the question. – orizon May 11 '18 at 20:58
  • Is your data composed of numeric data only? Is the computation that you need to do still much more demanding that the subsetting itself? – F. Privé May 11 '18 at 21:55
  • @f-privé My data is not numeric only; it is numeric, POSIXct and categorical. The computation is orders of magnitude more demanding than the subsetting. – orizon May 11 '18 at 22:48
  • So, I think the best would be to store each subset of the data.table in an RDS file. Then, each core read one file and do the computation on this part. – F. Privé May 12 '18 at 21:02

1 Answers1

1

Since cl is a list, simply subset it during the clusterExport call:

library(data.table)
library(parallel)

dt <- data.table(a = rep(1:5, each = 2), b = 11:20)
cl <- makeCluster(2)
idx <- c(2, 7)

for (i in seq_along(cl)) {
  dt2 <- dt[a < idx[i]]
  clusterExport(cl[i], "dt2")
}
rm(dt2)
clusterEvalQ(cl, colMeans(dt2))
#> [[1]]
#>    a    b 
#>  1.0 11.5 
#> 
#> [[2]]
#>    a    b 
#>  3.0 15.5
jblood94
  • 10,340
  • 1
  • 10
  • 15