Find the optimal combination of setting values for `number of processes` and `OMP_NUM_THREADS` in a particular computing task

Question

The testing environment is Ubuntu 20.04.3 LTS installed on a machine with dual Intel Xeon E5-2699 v4 and Supermicro X10DAi motherboard. I try to compile and test VASP.6.3.0 with recent/latest Intel oneAPI base and hpc toolkits.

The test commands are as follows:

VASP_TESTSUITE_EXE_STD="mpirun -np $nranks -genv OMP_NUM_THREADS=$nthrds -genv I_MPI_PIN_DOMAIN=omp -genv KMP_AFFINITY=verbose,granularity=fine,compact,1,0 -genv KMP_STACKSIZE=512m /home/werner/Public/hpc/vasp/vasp.6.3.0/testsuite/../bin/vasp_std"
VASP_TESTSUITE_EXE_NCL="mpirun -np $nranks -genv OMP_NUM_THREADS=$nthrds -genv I_MPI_PIN_DOMAIN=omp -genv KMP_AFFINITY=verbose,granularity=fine,compact,1,0 -genv KMP_STACKSIZE=512m /home/werner/Public/hpc/vasp/vasp.6.3.0/testsuite/../bin/vasp_ncl"
VASP_TESTSUITE_EXE_GAM="mpirun -np $nranks -genv OMP_NUM_THREADS=$nthrds -genv I_MPI_PIN_DOMAIN=omp -genv KMP_AFFINITY=verbose,granularity=fine,compact,1,0 -genv KMP_STACKSIZE=512m /home/werner/Public/hpc/vasp/vasp.6.3.0/testsuite/../bin/vasp_gam"

I found that the time performance may be very different for a specific job with different combination of np (i.e., number of processes) and OMP_NUM_THREADS. In my test, I found that the combination of -np 16 and OMP_NUM_THREADS=16 is very time-consuming, and I terminated this testing step before it was over. For a summary of the time benchmarks corresponding to the tests here, see this file and the discussion here and for more detailed information.

So a natural question is: How to find the optimal combination of setting values for number of processes and OMP_NUM_THREADS in a particular computing task? Is there a rule of thumb?

The following is supplementary information as a reply to the comments given by Victor Eijkhout, Homer512 and Jérôme Richard:

See the related info give by inxi:

werner@X10DAi-00:~$ inxi -Cxxx
CPU:       Topology: 2x 22-Core model: Intel Xeon E5-2699 v4 bits: 64 type: MT MCP SMP arch: Broadwell rev: 1 
           L2 cache: 110.0 MiB 
           flags: avx avx2 lm nx pae sse sse2 sse3 sse4_1 sse4_2 ssse3 vmx bogomips: 387287 
           Speed: 1200 MHz min/max: 1200/3600 MHz Core speeds (MHz): 1: 1200 2: 1202 3: 1202 4: 1202 5: 1200 
           6: 1202 7: 1203 8: 1201 9: 1204 10: 1201 11: 1654 12: 2007 13: 2204 14: 2200 15: 1245 16: 1202 
           17: 1202 18: 1202 19: 1203 20: 1202 21: 1203 22: 1202 23: 1202 24: 1201 25: 1202 26: 1202 27: 1201 
           28: 1202 29: 1202 30: 1202 31: 2066 32: 1202 33: 1202 34: 1202 35: 1203 36: 1202 37: 1202 38: 1202 
           39: 1202 40: 1202 41: 1200 42: 1516 43: 1200 44: 1200 45: 1200 46: 1202 47: 1200 48: 1200 49: 1200 
           50: 1200 51: 1201 52: 1201 53: 1201 54: 1201 55: 1200 56: 1201 57: 1204 58: 1200 59: 1200 60: 1609 
           61: 1871 62: 2200 63: 1251 64: 1201 65: 1201 66: 1201 67: 1200 68: 1203 69: 1200 70: 1201 71: 1201 
           72: 1201 73: 1201 74: 1201 75: 1200 76: 1200 77: 1200 78: 1201 79: 1203 80: 1523 81: 1201 82: 1200 
           83: 1200 84: 1201 85: 1201 86: 1200 87: 1200 88: 1204 
werner@X10DAi-00:~$ inxi -Mxxx
Machine:   Type: Desktop System: Supermicro product: X10DAi v: 123456789 serial: <superuser/root required> 
           Mobo: Supermicro model: X10DAI v: 1.02 serial: <superuser/root required> UEFI: American Megatrends 
           v: 3.2 date: 12/16/2019 
werner@X10DAi-00:~$ inxi -Sxxx
System:    Host: X10DAi-00 Kernel: 5.8.0-43-generic x86_64 bits: 64 compiler: N/A Desktop: GNOME 3.36.9 
           tk: GTK 3.24.20 wm: gnome-shell dm: GDM3 3.36.3 Distro: Ubuntu 20.04.3 LTS (Focal Fossa)

I retest the test discussed here. See the following for the time baseline and the corresponding combination of options:

nranks=4 nthrds=2
real    0m13.666s
user    1m20.643s
sys 0m4.314s

nranks=8 nthrds=2
real    0m11.908s
user    2m9.973s
sys 0m7.549s

nranks=12 nthrds=2
real    0m11.043s
user    2m55.062s
sys 0m11.161s

nranks=16 nthrds=2
real    0m11.087s
user    3m45.074s
sys 0m15.343s


nranks=4 nthrds=2
real    0m13.511s
user    1m19.949s
sys 0m4.185s

nranks=6 nthrds=4
real    0m13.736s
user    3m38.704s
sys 0m12.471s

nranks=8 nthrds=5
real    0m12.378s
user    5m13.113s
sys 0m18.022s

It seems that the above results are consistent with the comments given by Homer512:

Typical setups to test are one process per core (1-2 threads) or one per LLC with as many threads as appropriate.

Regards, HZ

You're not saying how many cores you have. In general, the product of MPI processes and OMP threads should not be more than the number of cores. If you have hyperthreads, they may help, but need not. — Victor Eijkhout, Feb 19 '22 at 04:42
What system are you using? How many cores, threads, sockets? What is the hierarchy of the last-level cache (LLC)? You can use `lstopo` from OpenMPI's hwloc package to find out. Typical setups to test are one process per core (1-2 threads) or one per LLC with as many threads as appropriate. — Homer512, Feb 19 '22 at 09:43
The `-np 16` should run 16 MPI processes and `OMP_NUM_THREADS=16` should run 16 thread per process resulting in 256 threads. The thing is your processor have 22 cores and 44 hardware threads. The number of socket is not provided but even with 4 socket per node (quite unusual), there will not be enough places so that threads fully runs in parallel. You need to care about this because it matters a lot. The thread binding too you you need to place threads on specific cores to get better performance in most cases. This is highly dependent of the target application. — Jérôme Richard, Feb 19 '22 at 11:30
It looks like the application clearly does not scale threads/processes... This is quite surprising for an HPC application. I advise you to check the affinity with `KMP_AFFINITY=verbose`. If threads are sharing the same hardware thread, then there is definitively a problem. If they share the same core then it is certainly sub-optimal. By the way, NUMA effect can also cause issue for some applications so it is probably wise to check this by running the application on only one socket (ie. using 1 MPI process on a node + `numactl` + the previous OpenMP env variables). — Jérôme Richard, Feb 19 '22 at 13:51
@JérômeRichard 1. As for `KMP_AFFINITY=verbose`, as you can see, the following setting has been used: `KMP_AFFINITY=verbose,granularity=fine,compact,1,0`. 2. What's the concrete setting corresponding to `running the application on only one socket`? I'm a newbie of openmp. — Hongyi Zhao, Feb 20 '22 at 01:59
@HongyiZhao Ha sory I missed it for `KMP_AFFINITY`. It would be good to have the result provided by the application that specify the binding of threads at runtime (that the purpose of using `verbose`). For the binding, see the doc of [`OMP_PLACES`](https://www.openmp.org/spec-html/5.0/openmpse53.html): you can use `OMP_PLACES="sockets(1)"`. This is not the best thing to do nor sufficient but simple. As for [`numactl`](https://linux.die.net/man/8/numactl) `--membind=0` and `--cpunodebind=0`. The best is to bind each thread manually but this is a bit cumbersome/tricky. — Jérôme Richard, Feb 20 '22 at 12:48
@JérômeRichard VASP is an HPC application, but it's very complicated. The scaling behavior depends very much on what you're modeling. — Victor Eijkhout, Feb 20 '22 at 13:27

Find the optimal combination of setting values for `number of processes` and `OMP_NUM_THREADS` in a particular computing task

0 Answers0