2

This is driving me crazy. The PBS script below works fine except that for the cd command. If the line cd $PBS_O_WORKDIR is uncommented, the process is running forever on the cluster.

#PBS -lnodes=1:ppn=8
#PBS -lwalltime=48:00:00
#PBS -S /bin/bash
echo $PBS_O_WORKDIR
#cd $PBS_O_WORKDIR
cat $PBS_NODEFILE
export THIS_HOST=$(hostname)
echo Hello World from host $THIS_HOST

Note: I submit the job with qsub test.bash

Returned output (if cd $PBS_O_WORKDIR is commented):

/scratch/users/angela/mpi_test
au01.cluster
au01.cluster
au01.cluster
au01.cluster
au01.cluster
au01.cluster
au01.cluster
au01.cluster
Hello World from host au01

Edited code with mpiexec line added:

#PBS -lnodes=1:ppn=8
#PBS -lwalltime=48:00:00
#PBS -S /bin/bash
echo $PBS_O_WORKDIR
#cd $PBS_O_WORKDIR
cat $PBS_NODEFILE
export THIS_HOST=$(hostname)
echo Hello World from host $THIS_HOST
NPROC=2
mpiexec -n $NPROC -hostfile $PBS_NODEFILE -mca plm_tm_verbose 1 hostname

In this case, an error message is returned:

[au01:47000] mca: base: component_find: unable to open /soft/openmpi/1.6.4/intel-13.1.1/lib/openmpi/mca_plm_tm: libtorque.so.2: cannot open shared object file: No such file or directory (ignored)
Roland
  • 427
  • 1
  • 4
  • 15
  • what happens when you properly double-quote your variables? `cd "$PBS_O_WORKDIR"` and `cat "$PBS_NODEFILE"` – Fravadona Jan 28 '22 at 09:21
  • 1
    I don't think you need the `cd $PBS_O_WORKDIR` if `$PBS_NODEFILE` contains the absolute path. – A-Tech Jan 28 '22 at 09:29
  • Nope. Not working. By the way, ```cat $PBS_NODEFILE``` with no double-quote works fine. – Roland Jan 28 '22 at 09:38
  • 1
    Please add output of `echo $PBS_O_WORKDIR` to your question (no comment here). – Cyrus Jan 28 '22 at 09:44
  • 1
    Your cluster is broken; it's a kind of network issue – Fravadona Jan 28 '22 at 09:44
  • Output added to the question, as suggested by @Cyrus. Plus, question edited with the code including the ```mpiexec``` command. @Fravadona: it could be (please see the edited question). @A-Tech: Yes, but even if not needed, that shouldn't freeze the code on the cluster, I think. – Roland Jan 28 '22 at 09:59
  • I have just checked with a colleague, and the cluster is currently working. – Roland Jan 28 '22 at 10:16
  • The most probable reason that your script _stalls_ when `cd`ing to `/scratch/users/angela/mpi_test` is that the corresponding network mount is not working properly from `au01.cluster` – Fravadona Jan 28 '22 at 10:31

1 Answers1

0

This seems to be related to PBS hanging (see https://github.com/rmodrak/seisflows/issues/18). No idea how to fix it.

Roland
  • 427
  • 1
  • 4
  • 15