Can the parallel or snow packages in R interface with a spark cluster?

Question

I am dealing with a computationally intensive package in R. This package has no alternative implementations that interfaces with a Spark cluster;however, it does have an optional argument to take in a cluster created with the parallel package. My question is can I connect to a spark cluster using something like SparklyR, and then use that spark cluster as part of a makeCluster command to pass into my function?

I have successfully gotten the cluster working with parallel, but I do not know how or if it is possible to leverage the spark clusters.

library(bnlearn)
library(parallel)

my_cluster <- makeCluster(3)
...
pc_structure <- pc.stable(train[,-1], cluster = my_cluster)

My question is can I connect to a spark cluster as follows:

sc <- spark_connect(master = "yarn-client", config = config, version = '1.6.2')

and then leverage the connection (the sc object) in the makeCluster() function?

Alternatively can I connect to the underlying EMR clusters that Spark is utilizing off of the RStudio Server? — niccalis, Mar 22 '19 at 21:27

score 0 · Answer 1 · answered May 20 '19 at 14:10

If that would solve your problem (and if I understand you correctly), I'd wrap your code that uses parallel package into a sparkR function, e.g. spark.lapply (or something similar in sparklyr, don't have experience with that).

I assume your Spark cluster is Linux based, hence the mcapply function from the parallel package should be used (instead of makeCluster and consequent clusterExport on Windows).

For example a locally executed task of summing up numbers in each element of a list would be (on Linux):

library(parallel)
input = list(c(1,2,3), c(1,2,3,4), c(1,2,3,4,5))
res = mclapply(X=input, FUN=sum, mc.cores=3)

and doing the same task 10000 times using a Spark cluster:

input = list(c(1,2,3), c(1,2,3,4), c(1,2,3,4,5))
save(input, file="/path/testData.RData")

res = spark.lapply(1:10000, function(x){
                    library(parallel)
                   load("/path/testData.RData")
                    mclapply(X=input, FUN=sum, mc.cores=3)
                    })

Question is whether your code be tweaked that way.

Can the parallel or snow packages in R interface with a spark cluster?

1 Answers1