8

I have the following question.

Why when submit the job on the standard node (maximum cores 56) everything runs fine, however when I submit the same job/code to the large_memory node (maximum cores 128), I get an error?

Parallelization code in R:

> no_cores <- detectCores() - 1

> cl <- makeCluster(no_cores, outfile=paste0('./info_parallel.log'))

Error

Error in socketConnection(master, port = port, blocking = TRUE, open = "a+b",  :
  cannot open the connection

Calls: <Anonymous> ... doTryCatch -> recvData -> makeSOCKmaster -> 
  socketConnection

In addition: Warning message:

In socketConnection(master, port = port, blocking = TRUE, open = "a+b",  :
  localhost:11232 cannot be opened
Execution halted

Error in unserialize(node$con) : error reading from connection
Calls: <Anonymous> ... doTryCatch -> recvData -> recvData.SOCKnode -> unserialize
Execution halted

Error in unserialize(node$con) : error reading from connection
Calls: <Anonymous> ... doTryCatch -> recvData -> recvData.SOCKnode ->  unserialize
Execution halted

As I said, the R code runs fine on the standard nodes, so I assume it is a problem with the large_memory node. What can that be?

Community
  • 1
  • 1
Helen Liu
  • 181
  • 1
  • 7
  • Many answers from google search are the following answers: It might be that a firewall stands between the compute node and the login node, or that the login node does not allow connections to port 11232 of the compute node. -- I tried to ssh to the compute node from the login node; then run the R code directly on the compute node. Or set the connection to the port 11232. But still got the same errors. – Helen Liu May 10 '17 at 18:56

1 Answers1

10

Finally, I solved it.

The error was caused by the default limit of connections in R. The default value of connections is 128. Here, "connections" means the number of cores per node, which are used in the code.

While, in the code, the errors happened at this line of cl <- makeCluster...

no_cores <- detectCores() - 1

cl <- makeCluster(no_cores, outfile=paste0('./info_parallel.log'))

Here, detectCores() will get the maximum number of cores on the node.

In the standard nodes of the cluster, the number of cores per node is less than 128, That's why the R code can run well on the standard nodes; while, the number of cores per node in large_memory partition is 128 in my case. It reaches the limit number of cores by default. So the error shows as:

cannot open the connection

I tried to set the number of cores as 120 for running jobs on the large_memory node (maximum cores = 128). No errors. The code works well.

cl <- makeCluster(120, outfile=paste0('./info_parallel.log'))
#                 ^^^

Thanks!

Robert Hacken
  • 3,878
  • 1
  • 13
  • 15
Helen Liu
  • 181
  • 1
  • 7
  • 5
    This is correct; there's a hardcoded number of connections that one R session can have open at any time; `NCONNECTIONS=128` of which three are always occupied so you can have at most 125 parallel PSOCK workers (given that you don't have any open files or other connections). If you're willing to rebuild R from source, you can increase this limit manually. I've successfully tried with 900. For more details on this issue ("you're not alone"), see https ://github.com/HenrikBengtsson/Wishlist-for-R/issues/28 – HenrikB May 12 '17 at 03:32