-1

I have four 32 cores linux servers (CentOS 7) that I would like to utilize for a parallelized computation in R

So far I have been only using doMC packages and registerDoMC(cores=32) to utilize the multicore capabilities of a single server. I would like to expand this to all four servers (i.e. 128=32x4, if possible)

I have done some searching online, seems like there are a bunch of choices: PSOCK, MPI, SNOW, SparkR, etc. Nonetheless, I could not get it work with any suggestion online.

I am aware there are some prerequisites, here is what I have done so far: 1) All servers are all "connected", ie. can SSH to each other with no-password login 2) NFS mounted so all servers can all access (read, write and execute access) 3) All servers run on the the same R binaries (under anaconda build on a shared locations which all servers can executed) 4) Installed openmpi, Rmpi, snow, doSNOW, Spark, SparkR (although I don't know how to use it)

Can another give some advise what I can do next?

Thanks a lot

lui.lui
  • 81
  • 1
  • 7
  • You can have a look at [this](http://spark.rstudio.com/) from the R side of Apache Spark, but you will need to setup your servers as a cluster for distributed computing. The best advice I can give you is to seek help from someone with experience in *high-performance computing* (HPC), this is not something trivial to setup properly. – Kevin Arseneau Mar 13 '18 at 07:26
  • I think my problem is exactly "how to setup a local cluster for distributed computing" – lui.lui Mar 13 '18 at 07:54
  • See my link, there is a tutorial there – Kevin Arseneau Mar 13 '18 at 08:02

1 Answers1

2

Have a look at the future package (I'm the author). It provides an ecosystem that wraps up various parallel backends in a unified API. In your particular case with four multiple 32-core machines to which you've already got SSH "batch" access, you can specify your 4*32 workers as:

library("future")

## Set up 4 * 32 workers on four machines
machines <- c("node1", "node2", "node3", "node4")
workers <- rep(machines, each = 32L)
plan(cluster, workers = workers)

If your machines don't have hostnames, you can specify their IP numbers instead.

Next, if you like to use foreach, just continue with:

library("doFuture")
registerDoFuture()

y <- foreach(i = 1:100) %dopar% {
  ...
  value
}

If you prefer lapply, you can use future.apply as:

library("future.apply")

y <- future_lapply(1:100, FUN = function(i) {
  ...
  value
})

Technical details:
The above sets up a PSOCK cluster as defined by the 'parallel' package. These are basically the same as SNOW clusters and by the same author who I think also consider SNOW cluster deprecated in place of what 'parallel' provides. In other words, AFAIK there is no point in using snow/doSNOW anymore; parallel/doParallel replaces those these days.

I'd put MPI clusters under the section of "advanced usage", i.e. unless you have one already set up and running, or unless you really think you need MPI, I would hold back on those. MPI also encourage a different algorithm design in order to take full advantage of them. PSOCK clusters take you a long way and only of you think you've exhausted those, you should look into MPI.

Spark is a whole different creature. It's designed around distributed computing on distributed data (in RAM). You're analysis might require that, but, again, I recommend that you start with above PSOCK clusters - they take you a long way.

A final PS, if you have a HPC scheduler (doesn't sound like it), just use, say, plan(future.batchtools::batchtools_sge) instead. Nothing else in your code needs to be changed.

HenrikB
  • 6,132
  • 31
  • 34
  • sounds like those are the exact libraries I need. Will give it a try and let you know. I also have two follow up questions: 1) why don't you put future/dofFuture/future.apply all the same package instead of having 3. 2) when it run foreach on 4M rows data in memory, how would it run in backend: Will 1M rows copy over to one R session in each machine, process the data with parallel cores, and send it back? or it would create 32 R sessions, copy 1/32M rows in each R session and process it without parallel and send it back? or I need to manually start R session on each machine..... – lui.lui Apr 11 '18 at 15:55