2

I have been trying to use multiple nodes in my PBS script to run several independent jobs. Each individual job is supposed to use 8 cores and each node in the cluster has 32 cores. So, I would like to have each node run 4 jobs. My PBS script is as follows.

#!/usr/bin/env bash
#PBS -l nodes=2:ppn=32
#PBS -l mem=128gb
#PBS -l walltime=01:00:00
#PBS -j oe
#PBS -V
#PBS -l gres=ccm

sort -u $PBS_NODEFILE > nodelist.dat
#cat ${PBS_NODEFILE} > nodelist.dat

export JOBS_PER_NODE=4  

PARALLEL="parallel -j $JOBS_PER_NODE --sshloginfile nodelist.dat --wd $PBS_O_WORKDIR"
$PARALLEL -a input_files.dat sh test.sh {}

input_files.dat contains the name of job files. I have successfully used this script to run parallel jobs on one node (in which case I remove --sshloginfile nodelist.dat and sort -u $PBS_NODEFILE > nodelist.dat from the script). However, whenever I try to run this script on more than one node, I get the following error.
ssh: connect to host 922 port 22: Invalid argument
ssh: connect to host 901 port 22: Invalid argument
ssh: connect to host 922 port 22: Invalid argument
ssh: connect to host 901 port 22: Invalid argument
Here, 922 and 901 are the numbers corresponding to the assigned nodes and are included in the nodelist.dat ($PBS_NODEFILE) file.
I tried to search for this problem but couldn't find much as everyone else seems to be doing fine with --sshloginfile argument, so I am not sure if this is a system specific problem.

Edit:

As @Ole Tange mentioned in his answer and comments, I need to modify the "node number" as produced by $PBS_NODEFILE, which I am doing in the following way inside the PBS script.

# provides a unique number (say, 900) associated with the node.
sort -u $PBS_NODEFILE > nodelist.dat

# changes the contents of the nodelist.dat from "900" to "username@w-900.cluster.uni.edu"
sed -i -r "s/([0-9]+)/username@w-\1.cluster.uni.edu/g" nodelist.dat

I verified that the nodelist.dat contains only one line viz., username@w-900.cluster.uni.edu.

Edit-2:

It seems like the cluster's architecture is responsible for the error I am getting. I ran the same script on a different cluster (say, cluster_2), and it finished without any errors. In my sysadmin's words, the reason why it works on cluster_2 is: "cluster_2 is a single machine. Once your job starts, you are actually on the head node of your PBS job like you would expect."

tobiuchiha
  • 73
  • 1
  • 7
  • I think you need to post the actual content of `nodelist.dat` to get a useful answer. Your question does not live up to MCVE, so you will need to provide as much actual evidence as possible. – Ole Tange Jan 19 '19 at 13:41

1 Answers1

1

The variable $PARALLEL is used by GNU Parallel for options. So when you also use it, it is likely to cause confusion. It does not seem to be the root cause here, though, but do yourself a favor and use another variable name (or use it as described in the man page).

The problem here seems to be ssh which will not see a number as a hostname:

$ ssh 8
ssh: connect to host 8 port 22: Invalid argument

Add the domain name, and ssh will see it as a hostname:

$ ssh 8.pi.dk
<<connects>>

If I were you I would talk to your cluster admin and ask if the worker nodes could be renamed to w-XXX, where XXX is their current name.

Ole Tange
  • 31,768
  • 5
  • 86
  • 104
  • Thank you for the useful suggestions. Just to be clear, if I login into my account using `username@cluster.uni.edu`, then the node number (say, 900) should be specified as `900@cluster.uni.edu`. I did try this, but I got another error saying `Host key verification failed`. But I think you are right, I should email my cluster admin. – tobiuchiha Jan 18 '19 at 23:23
  • @tobiuchiha That is an unconventional way of doing that, and may be what is confusing `ssh`. Normally you would log in with something like `username@w-900.cluster.uni.edu`. This way the sysadmin can see which user is running which program. – Ole Tange Jan 19 '19 at 02:19
  • I tried `username@w-900.cluster.uni.edu`, but I got `ssh: Could not resolve hostname w-900.cluster.uni.edu: Name or service not known` error. I am not much familiar with this. Are there any other in which domain name can be added? – tobiuchiha Jan 19 '19 at 05:34
  • Did you ask your sysadmin to rename the hosts first? Otherwise it obviously will not work. – Ole Tange Jan 20 '19 at 12:38
  • I did contact my sysadmin. I think he is not much familiar with GNU-parallel. But the gist of his reply was that the cluster's architecture does not support what I am trying to do. He did try to explain the reason behind it but I couldn't understand it much. Our university has three clusters, so I will give it a shot on the remaining two. But the cluster on which I was trying has the largest number of compute nodes, so I really wanted to make it work there. – tobiuchiha Jan 20 '19 at 20:04