0

First off, I have keypairs, this is not a passphrase question though ssh is involved.

I also have MPICH, Hydra, SLURM and lamd ... this is a cluster computing question.

Node0 will boot but node1 gets hung. I have had this problem for days now. My nfs mirror works just fine and I can run Game Of Life on 8 cores on node2 ... that is really cool too, just ask me about it...

BUT, when I want to run on all three nodes together I hit a password request from each node as node0 uses ssh to send the processes. Again, not a passphrase problem, HYDRA (slurm and lamd as well) wants my user password from node1. Basically my login credential. I can change that to an MPICHuser account; however the dilemma will remain.

Unless I create MPICHusers on all three nodes without passwords at all ... can that be done? It seems like the epitome of security risk.

So the question is, can I automate the password credential whenever @ pops up in a way that won't hang lamboot?

It is late, looking at what I have makes me wonder if slurm is the new culprit.

Here is more or less what I am looking at:

me@wherever:/mirror/GameOfLife$ mpiexec.hydra -f /mirror/machinefile -n 10 ./life 10 10 30

[mpiexec@wherever] HYDU_process_mfile_token (utils/args/args.c:296): token node0 not supported at this time

[mpiexec@wherever] HYDU_parse_hostfile (utils/args/args.c:343): unable to process token

[mpiexec@wherever] mfile_fn (ui/mpich/utils.c:336): error parsing hostfile

[mpiexec@wherever] match_arg (utils/args/args.c:152): match handler returned error

[mpiexec@wherever] HYDU_parse_array (utils/args/args.c:174): argument matching returned error

[mpiexec@wherever] parse_args (ui/mpich/utils.c:1596): error parsing input array

[mpiexec@wherever] HYD_uii_mpx_get_parameters (ui/mpich/utils.c:1648): unable to parse user arguments

[mpiexec@wherever] main (ui/mpich/mpiexec.c:153): error parsing parameters me@wherever:/mirror/GameOfLife$

  • under the hood, a spawn tree might be used. that means that if you have 3 hosts node[0-2], node0 will ssh node1, and then node0 might ssh node2, or node1 might ssh node2. bottom line, any host should be able to ssh any host – Gilles Gouaillardet Jul 30 '17 at 13:45

1 Answers1

0

That is not the problem. I am looking toward Slurm comparability. Several things happen at nearly the same time in a specific order. The handler has to have terminal control in an instant so the master node can begin sending. Before I added Slurm the hydra machinefile was working but node0 could not "grab" the keyboard. Where should Slurm look for an equivalent file? I am wondering if I should remove hydra.

  • 1
    Please, avoid using answers for comments on your questions or for discussion. Use editing of the original post, comments or chat instead. – Gasper Jul 31 '17 at 14:54
  • which MPI are you using ? can you also post the content of your `/mirror/machinefile` ? – Gilles Gouaillardet Aug 01 '17 at 01:48
  • Last night I saw that Slurm created a user account on each machine. The machine file is nothing spectacular, just the node names and cpu count. Hydra finds it. There just to many things happening at once. I use a KVM switch to share the terminal, mouse and keyboard. Slurm isn't used the way I expected. I can try srun tonight. But I already know the munge boot only echoed to one machine, not all three. A few weeks ago I had lamd working most of the time-before I added the KVM. The mpich is older, I did that intentionally. It is 3.0.1 and I downloaded the hydra that was beside it. – Jonathan Engwall Aug 01 '17 at 22:53
  • my understanding is that hydra uses ssh regardless you are running under SLURM or not. (fwiw, mpirun uses `srun` under SLURM). the error message you initially posted suggest a syntax error in your machinefile. note if you are running under SLURM, you do not even need a machinefile at all ... – Gilles Gouaillardet Aug 02 '17 at 02:15
  • That explains "not supported." Thank you. I don't have time until the weekend. I will tell you how it goes. – Jonathan Engwall Aug 03 '17 at 02:02