I am asking for a solution of an issue I do not get behind. I am using a slurm cluster and have a python script 'SOLVER.py', which uses mpirun commands in its code (calls a numerical spectral-element simulation). Each node on the cluster consists of 40 processors. As an example, I would like to call 5 nodes and run the script 'solver.py' on each node (5 times, in parallel) with 40 processors each.
#!/bin/bash
#
#SBATCH --job-name=solv
#SBATCH --comment="example"
#SBATCH --exclusive
#SBATCH --workdir=. ### directory where the job should be executed from
#SBATCH --mem=150000 ### RAM in MB required for the job (this is mandatory!!!)
#SBATCH --nodes=1 ### Node count required for the job
#SBATCH --exclusive
#SBATCH --output=slurm.%j.out ### output file for console output
#SBATCH --partition=long ### partition where the job should run (short|medium|long|fatnodes)
# ...
#export OMP_NUM_THREADS=160
python SOLVER.py
.. works fine. Now what is the correct method to run the script 5 times on 5 nodes? I tried many different things, varying ntasks, different srun combinations and a plugin called jug, but always get different problems.
Could someone help me? :)
Best regards,
Max