I'm trying to setup a cluster across machines on a PBS managed cluster. I'm perfectly able to compute within one node by saying julia -p 12
(after having reserved one node with 12 CPUs).
I understand that to use several machines, I have to add them to the master process with addprocs
. I was able to do that on a different cluster (SGE). on this one here something is going wrong.
You can see everything I'm doing, including submit scripts etc, on this branch of a github repo.
to get a list of machines, I parse the PBS_NODEFILE, which for the case of a submit script with option
#PBS -l nodes=2:ppn=12 # give me 2 nodes with 12 processors each
looks like something like this:
red0004
red0004
...
red0004
red0347
...
red0347
I parse this file with bind_pe_procs()
in sge.jl
in the repo and give a vector of machine names to addprocs
. When I submit this I get this error which I put up a gist with the resulting SSH error. I don't know what it means.
has this to do with a system setting, ie do i have to talk to the sys admin about SSH between machines? What are the right questions to ask?
I am unsure about what exactly I have to give to
addprocs()
. I don't want to add the master process (I don't want worker 1 SSHing into itself?), so I excludeENV["HOST"] = node001
from my list. but what about all processors with the same namenode002
? do i list all of thosemachines = [ "red0347" for i=1:12]
or just once
machines = ["red0347"]
in
addprocs(machines)
thanks!