0

I am struggling to find the proper way to execute a hybrid OpenMP/MPI job with MPICH (hydra).

I am easily able to launch the processes and they do make threads, but they are stuck bound to the same core as their master thread whatever type of -bind-to I tried.

If I explicitly set GOMP_CPU_AFFINITY to 0-15 I get all threads spread but only provided if I have 1 process per node. I don't want that, I want one process per socket.

Setting OMP_PROC_BIND=false does not have a noticeable effect.

An example of many different combinations I tried

export OMP_NUM_THREADS=8
export OMP_PROC_BIND="false"
mpiexec.hydra -n 2 -ppn 2 -envall -bind-to numa  ./a.out

What I get is all process sitting on one of the cores 0-7 with 100% and several threads on cores 8-15 but only one of them close to 100% (they are waiting on the first process).

2 Answers2

3

I do have a somewhat different solution for binding OpenMP threads to sockets / NUMA nodes when running a mixed MPI / OpenMP code, whenever the MPI library and the OpenMP runtime do not collaborate well by default. The idea is to use numactl and its binding properties. This has even the extra advantage of not only binding the threads to the socket, but also the memory, forcing good memory locality and maximising the bandwidth.

To that end, I first disable any MPI and/or OpenMP binding (with the corresponding mpiexec option for teh former, and with setting OMP_PROC_BIND to false for the later). Then I use the following omp_bind.sh shell script:

#!/bin/bash

numactl --cpunodebind=$(( $PMI_ID % 2 )) --membind=$(( $PMI_ID % 2 )) "$@"

And I run my code this way:

OMP_PROC_BIND="false" OMP_NUM_THREADS=8 mpiexec -ppn 2 -bind-to-none omp_bind.sh a.out args

Depending on the number of sockets on the machine, the 2 would need to be adjusted on the shell. Likewise, the PMI_ID depends on the version of mpiexec used. I saw sometimes MPI_RANK, PMI_RANK, etc.

But anyway, I always found a way of getting it to work and the memory binding comes very handy sometimes, especially to avoid the potential pitfall of the IO buffers eating up all memory on the first NUMA node, leading to the code's memory for the process running on the first socket, allocating memory on the second NUMA node.

Gilles
  • 9,269
  • 4
  • 34
  • 53
  • Interesting. Are you using `numactl --membind` because MPICH and siblings cannot do memory binding on their own? Open MPI does memory binding and even implements some NUMA optimisation in the shared-memory transport. Just asking out of curiosity - I'm an Open MPI (and to a very small extent an Intel MPI) user and MPICH and co is pretty foreign to me. – Hristo Iliev Nov 14 '15 at 13:11
  • @HristoIliev TBH, I strated to do this a long time ago, because I had sometimes to run performance tests on machines where MPI wasn't properly configured. It was much simpler to just use `numactl` this way rather than trying to fix the MPI configuration. And at this time OpenMP didn't have any native binding. Nowadays, I'm mostly an Intel MPI user but I'm not too sure how memory binding is managed there. What's for sure is that this old same `numactl` trick still works whenever needed, and moreover, by willingly setting it wrong, it allows to evaluate the potential NUMA effect on a code – Gilles Nov 15 '15 at 01:44
  • Thanks for the other option. I will investigate both. I have some performance problems and it will be interesting to see the effect. – Vladimir F Героям слава Nov 15 '15 at 21:59
2

Since libgomp is missing the equivalent of the respect clause of Intel's KMP_AFFINITY, you could hack it around by providing a wrapper script that reads the list of allowed CPUs from /proc/PID/status (Linux-specific):

#!/bin/sh

GOMP_CPU_AFFINITY=$(grep ^Cpus_allowed_list /proc/self/status | grep -Eo '[0-9,-]+')
export GOMP_CPU_AFFINITY
exec $*

This should work with -bind-to numa then.

Hristo Iliev
  • 72,659
  • 12
  • 135
  • 186