2

I am running different R scripts in batch mode at once in a linux cluster to estimate a model in different data sets (it also happens when I run it in Mac). The scripts are exactly the same, except for the data set that they are using. I get the following message when I do that.

Error in socketConnection("localhost", port = port, server = TRUE, blocking = TRUE, : 
cannot open the connection
Calls: makePSOCKcluster -> newPSOCKnode -> socketConnection
In addition: Warning message:
In socketConnection("localhost", port = port, server = TRUE, blocking = TRUE, :
port 11426 cannot be opened

Here is a reproducible example. Create two files, tmp1.R and tmp2.R, and tmp.sh with the content:

Content of the files tmp1.R and tmp2.R:

library(dclone)
l <- list(1:100,1:100,1:100,1:100)
cl <- makePSOCKcluster(4)
parLapply(cl, X=l, fun=function(x) {Sys.sleep(2); sum(x); })
stopCluster(cl)

Content of the tmp.sh file:

#!/bin/sh
R CMD BATCH tmp1.R &
R CMD BATCH tmp2.R &

The first file in the list will be executed. The second will present the error above. Does anyone know how to solve it and still run all the scripts at once automatically without any manual intervention?

PS: I have read all the other similar questions, none have a reproducible example or an answer to the question above.

Diogo
  • 842
  • 2
  • 11
  • 15
  • 1
    Why are you starting two clusters at once? – Hong Ooi Oct 27 '16 at 00:30
  • Not exactly answer... But why do you use SOCK clusters if running the code on the localhost? On linux/mac you may use fork-clusters (or just mclapply.) – Ott Toomet Oct 27 '16 at 00:36
  • For @HongOoi, this is just a toy example. The reason is more complicated and has to do with the real application. What would be the alternative to do the exact same thing (i.e. run many scripts automatically and using parallelization in each one of them)? – Diogo Oct 27 '16 at 01:19
  • For @OttToomet, I am running in a cluster. It doesn't work neither in the cluster nor in the localhost. You can use fork and that will not work too. The reason I am using SOCK is here: https://stat.ethz.ch/R-manual/R-devel/library/parallel/html/makeCluster.html – Diogo Oct 27 '16 at 01:19
  • I you are on a multi-node cluster I very much recommend to use MPI. Either Rmpi (if not too many workers) or rather Rhpc clusters. – Ott Toomet Oct 27 '16 at 01:21
  • Can you tell a bit more about the cluster? In particular, does it have a scheduler? Do you need jobscripts? What does it mean "when you run it on mac"? – Ott Toomet Oct 27 '16 at 01:22

1 Answers1

2

You don't need to start multiple clusters to run the same code on multiple datasets. Just send the correct data to each node.

# make 4 distinct datasets
df1 <- mtcars[1:8,]
df2 <- mtcars[9:16,]
df3 <- mtcars[17:24,]
df4 <- mtcars[25:32,]

# make the cluster
cl <- makeCluster(4)

clusterApply(cl, list(df1, df2, df3, df4), function(df) {
    # do stuff with df
    # each node will use a different subset of data
    lm(mpg ~ disp + wt, df)
})

If you want the data to be persistent on each node, so you can use it for subsequent analyses:

clusterApply(cl, list(df1, df2, df3, df4), function(df) {
    assign("df", df, globalenv())
    NULL
})

This creates a df data frame on each node, which will be unique to that node.

Hong Ooi
  • 56,353
  • 13
  • 134
  • 187