0

I am running a physics solver that was written to use hybrid OpenMP/MPI parallelization. The job manager on our cluster is SLURM. Everything goes as expected when I am running in a pure MPI mode. However, once I try to use hybrid parallelization strange things happen:

1) First I tried the following SLURM block:

#SBATCH --ntasks=8
#SBATCH --ntasks-per-node=2
#SBATCH --cpus-per-task=16

(hint: 16 is the number of physical cores on the processors on the cluster)

However what happens is that the simulation runs on 4 nodes and there I see 4 used cores each (in htop). Moreover the solver tells me it is started on 16 cores, which I do not really understand. It should be 8*16=128 I think.

2) As the above was not successful, I added the following loop to my SLURM script:

if [ -n "$SLURM_CPUS_PER_TASK" ]; then
  omp_threads=$SLURM_CPUS_PER_TASK
else
  omp_threads=1
fi
export OMP_NUM_THREADS=$omp_threads

What happens is that the solver tells me now that it is started on 128 cores. But when using htop on the respective nodes it becomes obvious that these OpenMP threads use the same cores, so the solver is ultra slow. The developer of the code told me he never used the loop I added so there might be something wrong about that, but I do not understant why the OpenMP threads use the same cores. However, in htop, the threads seem to be there. Another strange thing is that htop shows me 4 active cores per cluster... I would have expected either 2 (for the 2 MPI tasks per node) or rather, if everything would go as planned, 32 (2 MPI tasks running 16 OMP threads each).

We once head an issue already as the developer uses an Intel Fortran compiler and I use a GNU fortran compiler (mpif90 respectively mpifort).

Has anyone an idea how I can make my OpenMP threads use all the cores available instead of only some few?

Some system / code info:

Linux distro: OpenSUSE Leap 15.0

Compiler: mpif90

Code: FORTRAN90

tre95
  • 433
  • 5
  • 16

1 Answers1

0

So few things, by using:

#SBATCH --ntasks=8
#SBATCH --ntasks-per-node=2
#SBATCH --cpus-per-task=16

You tell you want 8 tasks (i.e MPI worker), and have two of them per nodes, so it is normal to have the code starting on 4 nodes.

Then you tell each MPI worker to use 16 OMP threads. You say :

Moreover the solver tells me it is started on 16 cores

Probably the solver look at OMP threads, so it is normal for him to indicate 16. I don't know the detail of your code, but usually if you solve a problem on a grid, you will split the grids in subdomains (1 per MPI) and solve with OMP on this subdomains. So you have in your case 8 solvers running in parallel, each of them using 16 cores.

The command export OMP_NUM_THREADS=$omp_threads and the if block you added are correct (btw this is not a loop).

If you have 16 cores per nodes on the cluster, your configuration should rather be:

#SBATCH --ntasks=8
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=16

So one MPI per node and then 1 OMP per core, instead of two now which will probably just slow down the code.

Finally, how do you get the htopoutput, do you log to the compute node? It is usually not. A good idea on clusters.

I know this is not a full answere but without the actual code it is a bit hard to tell more and this was too long to be posted as a comment.

Chelmy88
  • 1,106
  • 1
  • 6
  • 17