2

I'm trying to setup a cluster across machines on a PBS managed cluster. I'm perfectly able to compute within one node by saying julia -p 12 (after having reserved one node with 12 CPUs).

I understand that to use several machines, I have to add them to the master process with addprocs. I was able to do that on a different cluster (SGE). on this one here something is going wrong.

You can see everything I'm doing, including submit scripts etc, on this branch of a github repo.

to get a list of machines, I parse the PBS_NODEFILE, which for the case of a submit script with option

#PBS -l nodes=2:ppn=12  # give me 2 nodes with 12 processors each

looks like something like this:

red0004
red0004
...
red0004
red0347
...
red0347

I parse this file with bind_pe_procs() in sge.jl in the repo and give a vector of machine names to addprocs. When I submit this I get this error which I put up a gist with the resulting SSH error. I don't know what it means.

  • has this to do with a system setting, ie do i have to talk to the sys admin about SSH between machines? What are the right questions to ask?

  • I am unsure about what exactly I have to give to addprocs(). I don't want to add the master process (I don't want worker 1 SSHing into itself?), so I exclude ENV["HOST"] = node001 from my list. but what about all processors with the same name node002? do i list all of those

    machines = [ "red0347" for i=1:12]
    

    or just once

    machines = ["red0347"]
    

    in addprocs(machines)

thanks!

Florian Oswald
  • 5,054
  • 5
  • 30
  • 38
  • Unfortunately, I don't have a way to test PBS, but it looks like to me it's not PBS itself, but rather just ssh. Can you do a quick test if possible and explore what happens if you set up a [login via public key](http://stackoverflow.com/questions/7260/how-do-i-setup-public-key-authentication)? – waTeim Aug 05 '14 at 00:39
  • The sys admin got back to me saying they only support rsh from node to node. I tried that and it actually works. Are you familiar with how Julia starts remote workers? Is it a hell of a hack to implement an rsh worker? I asked whether they might consider allowing ssh but I have little hope. I also suspect they overwrote the id_rsa.pub on each node with my publickey, so I've got a dysfunctional ssh key pair on each node. Can I as non root generate new keys on each host? I doubt it. Thanks for your help! – Florian Oswald Aug 05 '14 at 08:13
  • Re Q1; not sure I'll check. Re Q2, if you are talking about the pub key in your home directory to allow you to log in without a password then sure, you can regenerate that. If you're talking about the host key to establish host identity, no you'll need help to change it, but that might not matter as you might be able to disable the host check on the client side. – waTeim Aug 05 '14 at 09:26
  • 1
    great - thanks. just to make sure I understand: If I have a machine with 12 CPUs (as above), but they all have the same name ("red0347", say), I only give "red0347" once to addprocs() or 12 times? I'm confused as to how one can distinguish different processes (ie CPUs) when they have the same name. – Florian Oswald Aug 05 '14 at 12:53
  • There could be multiple things needing to be set, I'd like to postpone answering that until after you're able to ssh to those hosts. – waTeim Aug 05 '14 at 14:07
  • hey! good news: i got the ssh going. can do `ssh node2` from `node1`. i've now got this weird problem that i cannot start julia on the other node because apparently an environment var is not set. I set `module load gcc/4.8.1` in the submit script, but it cannot find the libraries. btw: [the code is here](https://github.com/floswald/mpitest/tree/master/julia/iridis) – Florian Oswald Aug 05 '14 at 15:33
  • Hmm, not sure about that value of LD_LIBRARY_PATH. How about LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/eisuc151/local/lib/julia as a guess (or wherever the library libjulia.so is located) – waTeim Aug 05 '14 at 15:39
  • oh you think? it was complaining about not finding libstd++6.0.so (and actually when i unload the gcc module on the header i get the same). but worth a try. let me see – Florian Oswald Aug 05 '14 at 15:49
  • Oh, heh, libstd++.so.6 should be in /usr/lib or perhaps /usr/local/lib, but PBS is strange and wondrous to me, you're more knowledgeable about it. – waTeim Aug 05 '14 at 15:53

0 Answers0