I have been trying to use multiple nodes in my PBS script to run several independent jobs. Each individual job is supposed to use 8 cores and each node in the cluster has 32 cores. So, I would like to have each node run 4 jobs. My PBS script is as follows.
#!/usr/bin/env bash
#PBS -l nodes=2:ppn=32
#PBS -l mem=128gb
#PBS -l walltime=01:00:00
#PBS -j oe
#PBS -V
#PBS -l gres=ccm
sort -u $PBS_NODEFILE > nodelist.dat
#cat ${PBS_NODEFILE} > nodelist.dat
export JOBS_PER_NODE=4
PARALLEL="parallel -j $JOBS_PER_NODE --sshloginfile nodelist.dat --wd $PBS_O_WORKDIR"
$PARALLEL -a input_files.dat sh test.sh {}
input_files.dat
contains the name of job files. I have successfully used this script to run parallel jobs on one node (in which case I remove --sshloginfile nodelist.dat
and sort -u $PBS_NODEFILE > nodelist.dat
from the script). However, whenever I try to run this script on more than one node, I get the following error.
ssh: connect to host 922 port 22: Invalid argument
ssh: connect to host 901 port 22: Invalid argument
ssh: connect to host 922 port 22: Invalid argument
ssh: connect to host 901 port 22: Invalid argument
Here, 922
and 901
are the numbers corresponding to the assigned nodes and are included in the nodelist.dat ($PBS_NODEFILE)
file.
I tried to search for this problem but couldn't find much as everyone else seems to be doing fine with --sshloginfile
argument, so I am not sure if this is a system specific problem.
Edit:
As @Ole Tange mentioned in his answer and comments, I need to modify the "node number" as produced by $PBS_NODEFILE, which I am doing in the following way inside the PBS script.
# provides a unique number (say, 900) associated with the node.
sort -u $PBS_NODEFILE > nodelist.dat
# changes the contents of the nodelist.dat from "900" to "username@w-900.cluster.uni.edu"
sed -i -r "s/([0-9]+)/username@w-\1.cluster.uni.edu/g" nodelist.dat
I verified that the nodelist.dat
contains only one line viz., username@w-900.cluster.uni.edu
.
Edit-2:
It seems like the cluster's architecture is responsible for the error I am getting. I ran the same script on a different cluster (say, cluster_2), and it finished without any errors. In my sysadmin's words, the reason why it works on cluster_2 is: "cluster_2 is a single machine. Once your job starts, you are actually on the head node of your PBS job like you would expect."