0

I am using Srun command to submit a computational job onto the Linux but the output data was duplicated. Here is the shell script for job submission.

#!/bin/bash
#SBATCH --partition=short
#SBATCH --job-name="vasp"
#SBATCH --nodes=2
#SBATCH --time=24:00:00
#SBATCH --constraint=ib
#SBATCH --exclusive
#SBATCH --err=std.err
#SBATCH --output=std.out
#----------------------------------------------------------#
export OMP_NUM_THREADS=1
#----------------------------------------------------------#
echo "The job "${SLURM_JOB_ID}" is running on "${SLURM_JOB_NODELIST}
#----------------------------------------------------------#
source /shared/centos7/intel/oneapi/2021.1_u9-base/setvars.sh
srun --ntasks=40 --hint=nomultithread --ntasks-per-node=20 --ntasks-per-socket=2 --ntasks-per-core=1 --mem-bind=v,local /work/bin/v_c

Here is the duplicated output data.

:: oneAPI environment initialized ::
MPI startup(): PMI server not found. Please set I_MPI_PMI_LIBRARY variable if it is not a singleton case.
MPI startup(): PMI server not found. Please set I_MPI_PMI_LIBRARY variable if it is not a singleton case.
MPI startup(): PMI server not found. Please set I_MPI_PMI_LIBRARY variable if it is not a singleton case.
...
       N       E                     dE             d eps       ncg     rms          rms(c)
       N       E                     dE             d eps       ncg     rms          rms(c)
DAV:   1     0.980384438844E+03    0.98038E+03   -0.43531E+04  6372   0.144E+03
DAV:   1     0.980384438844E+03    0.98038E+03   -0.43531E+04  6372   0.144E+03
DAV:   1     0.980384438844E+03    0.98038E+03   -0.43531E+04  6372   0.144E+03
...
DAV:  55    -0.911176657386E+02   -0.16384E-05   -0.23427E-04  6760   0.627E-02    0.587E-03
DAV:  54    -0.911176641002E+02   -0.12570E-05   -0.43068E-04  6600   0.795E-02    0.559E-03
DAV:  55    -0.911176657386E+02   -0.16384E-05   -0.23427E-04  6760   0.627E-02    0.587E-03
DAV:  56    -0.911176678701E+02   -0.21315E-05   -0.36418E-04  6648   0.730E-02    0.762E-03
DAV:  54    -0.911176641002E+02   -0.12570E-05   -0.43068E-04  6600   0.795E-02    0.559E-03
DAV:  54    -0.911176641002E+02   -0.12570E-05   -0.43068E-04  6600   0.795E-02    0.559E-03
DAV:  55    -0.911176657386E+02   -0.16384E-05   -0.23427E-04  6760   0.627E-02    0.587E-03

There should be only output line like the followings.

       N       E                     dE             d eps       ncg     rms          rms(c)
DAV:   1     0.980384438844E+03    0.98038E+03   -0.43531E+04  6372   0.144E+03
...
DAV:  54    -0.911176641002E+02   -0.12570E-05   -0.43068E-04  6600   0.795E-02    0.559E-03
DAV:  55    -0.911176657386E+02   -0.16384E-05   -0.23427E-04  6760   0.627E-02    0.587E-03
DAV:  56    -0.911176678701E+02   -0.21315E-05   -0.36418E-04  6648   0.730E-02    0.762E-03

Would anyone please help me modify my shell script file to sort out this problem?

Many thanks.

janneb
  • 36,249
  • 2
  • 81
  • 97
Kieran
  • 123
  • 1
  • 10
  • 1
    the message is almost self explanatory. you did not set the `I_MPI_PMI_LIBRARY` environment variable, so you are running in parallel 40 single task MPI jobs instead of a single MPI job made of 40 tasks. – Gilles Gouaillardet Nov 05 '21 at 11:42

1 Answers1

1

Most likely you're not using the MPI version of VASP, so instead it starts two instances of the serial version on the two nodes you have allocated.

As an aside, --ntasks-per-node=20 --ntasks-per-socket=2 looks nonsensical unless you really have nodes with 10 sockets.

janneb
  • 36,249
  • 2
  • 81
  • 97
  • Thank you for the reply. The cluster has ten CPUs on each node and I want to use two nodes to do the calculation. Do you think ```--ntasks-per-node=20 --ntasks-per-socket=2``` setting is fine or not? – Kieran Nov 05 '21 at 10:01
  • @Kieran: I really doubt you have a system with 10 sockets per node, and only two CPU cores per socket. I'd just remove those two options. In fact, you might want to remove the `--nodes=` as well, and set `--ntasks=2` until you have figured out your original problem. Once you get MPI working properly, you can start increasing `ntasks`. – janneb Nov 05 '21 at 11:27