I am running a physics solver that was written to use hybrid OpenMP/MPI parallelization. The job manager on our cluster is SLURM. Everything goes as expected when I am running in a pure MPI mode. However, once I try to use hybrid parallelization strange things happen:
1) First I tried the following SLURM block:
#SBATCH --ntasks=8
#SBATCH --ntasks-per-node=2
#SBATCH --cpus-per-task=16
(hint: 16 is the number of physical cores on the processors on the cluster)
However what happens is that the simulation runs on 4 nodes and there I see 4 used cores each (in htop). Moreover the solver tells me it is started on 16 cores, which I do not really understand. It should be 8*16=128 I think.
2) As the above was not successful, I added the following loop to my SLURM script:
if [ -n "$SLURM_CPUS_PER_TASK" ]; then
omp_threads=$SLURM_CPUS_PER_TASK
else
omp_threads=1
fi
export OMP_NUM_THREADS=$omp_threads
What happens is that the solver tells me now that it is started on 128 cores. But when using htop on the respective nodes it becomes obvious that these OpenMP threads use the same cores, so the solver is ultra slow. The developer of the code told me he never used the loop I added so there might be something wrong about that, but I do not understant why the OpenMP threads use the same cores. However, in htop, the threads seem to be there. Another strange thing is that htop shows me 4 active cores per cluster... I would have expected either 2 (for the 2 MPI tasks per node) or rather, if everything would go as planned, 32 (2 MPI tasks running 16 OMP threads each).
We once head an issue already as the developer uses an Intel Fortran compiler and I use a GNU fortran compiler (mpif90 respectively mpifort).
Has anyone an idea how I can make my OpenMP threads use all the cores available instead of only some few?
Some system / code info:
Linux distro: OpenSUSE Leap 15.0
Compiler: mpif90
Code: FORTRAN90