1

I have tried to evaluate an OpenMPI program with Matrix Multiplication algorithm, the written code scales very well on a single thread per core machine in our Laboratory (close to ideal speedup within 48 and 64 cores), However, on some other machines which are hyperthreaded there is strange behavior, as you can see in the screenshot from htop I realized the CPU utilization when I run the same experiment with the same command is different and strange, I executed the program with

mpirun --bind-to hwthread--use-hwthread-cpus -n 2 ...

enter image description here

Here I bind the MPI workers to each hwthread, and can be seen with -n 2 which means I overwrite the variable in such a way to bind the execution on two processors (here hwthreads), however, seems it uses another hwthread with more or less 50% of utilization as well! I found this strange because there is not any extra CPU utilization on other machines, I tried this experiment many times and I'm sure this is not a temporary check or sth by OS and is due to the execution model of OpenMPI.

I appreciate it if someone could explain this behavior and extra CPU utilization when I execute this on the hyper-threaded machine.

The output of lscpu is as below:

lscpu
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   43 bits physical, 48 bits virtual
CPU(s):                          32
On-line CPU(s) list:             0-31
Thread(s) per core:              2
Core(s) per socket:              16
Socket(s):                       1
NUMA node(s):                    1
Vendor ID:                       AuthenticAMD
CPU family:                      23
Model:                           1
Model name:                      AMD Ryzen Threadripper 1950X 16-Core Processor
Stepping:                        1
Frequency boost:                 enabled
CPU MHz:                         2200.000
CPU max MHz:                     3400.0000
CPU min MHz:                     2200.0000
BogoMIPS:                        6786.36
Virtualization:                  AMD-V
L1d cache:                       512 KiB
L1i cache:                       1 MiB
L2 cache:                        8 MiB
L3 cache:                        32 MiB

The version of OpenMPI for all machines is the same 2.1.1. Maybe Hyperthreading is not the case and I was misled by this, but the only big difference between these environments are 1) the Hyperthreading and 2) Clock Frequency of the processors which is based on different CPUs is different between 2200 MHz to 4.8 GHz.

Elephant88
  • 117
  • 1
  • 12
  • You may need to supply some additional info to get a good answer (also check your usage of MHz vs. GHz). What Linux kernel version are you using? Is it possible you are saturating your hard drive read/write bandwidth, or even your memory bandwidth? Are you using any additional programs/drivers to alter voltage or CPU states? What CPU state are your processors in most of the time? Note that [aggressive monitoring](https://www.reddit.com/r/Amd/comments/cbls9g/the_final_word_on_idle_voltages_for_3rd_gen_ryzen/?utm_source=share&utm_medium=ios_app&utm_name=iossmf) of Ryzen CPUs affect performance. – Parker Sep 12 '21 at 13:49
  • Do you expect 50% or 100% cpu usage in htop? try to pin a cpu intensive program on a single hwthread and double check htop displays what you expect. when running the program, `grep Cpus_allowed_list /proc//status` to figure out on which hwthread each MPI task is bound. Do the hwthreads belong to the same core? – Gilles Gouaillardet Sep 12 '21 at 14:08
  • @vallismortis Thanks for the comment, the Linux kernel is ```5.11.0-27-generic``` and I did some experiments with the "bandwidth" benchmarking tool and Memory bandwidth is quite the same. I monitored the load of the system by ```w``` command and usually is aroud 0.7 with monitor and 0.3 when I detach monitor of the system. The Ryzen family as you mentioned is a good point I will look at this monitoring aspect of the Ryzen family. – Elephant88 Sep 12 '21 at 16:52
  • @GillesGouaillardet, I expect when I execute pure openmpi program with ```-n 2``` and ```--use-hwthread-cpus``` options, this executes two processes on two hwthreads, no more or no less. The program is already CPU intensive and each process uses exactly 100% of each core (on other machines which are not hwthreaded). The last is yes two hwthreads are on the same core. as you see in the htop output is used another hwthread of the another core by 50% as well. this is not the correct execution! I don't know what is wrong, the machine is not in the **performance mode** could it be the case? – Elephant88 Sep 12 '21 at 17:22
  • Let me see if I get it right: your two MPI processes are correctly pinned on the two hwthreads of the first core and hence run at 100%, so far so good. The issue is one or more other processes run at 50% utilization on the second core. Am I correct so far? If so, your next step is to figure out what is running on core 2. The first suspects would be `mpirun` and `htop`. In order to figure this out, you can manually migrate processes to an other core, for example via the `taskset-cp ...` command. – Gilles Gouaillardet Sep 13 '21 at 00:42
  • Hi @GillesGouaillardet, No the problem is two MPI processes have been executed with total 250% usage of the CPU. hwthread_1 with 100%, hwthread_2 with100%, and 50% of another core, which I realized is some kernel activity that comes up when I run the mpi program. I don't know what is this huge kernel activity takes too many resources(close to 50% of the one hwthread). – Elephant88 Sep 13 '21 at 06:18
  • try `mpirun --mca pml ob1 ...` and see if it changes the behavior – Gilles Gouaillardet Sep 13 '21 at 08:10
  • @GillesGouaillardet no this option doesn't work either, I'm getting closer to the fact that Hyperthreading doesn't work ideally on PureMPI, since it makes kernel do extra effort to manage shared resources between threads on a core. The program scales linearly on very small and cheap x86 board (4 cores) too but not on very expensive PC doesn't work with the same openMPI version etc... – Elephant88 Sep 13 '21 at 15:13
  • Consider updating Open MPI to a supported version (e.g. 4.1.1) when running on the latest hardware. – Gilles Gouaillardet Sep 13 '21 at 23:12

0 Answers0