Intel MPI benchmark fails when # bytes > 128: IMB-EXT

Question

I just installed Linux and Intel MPI to two machines:

(1) Quite old (~8 years old) SuperMicro server, which has 24 cores (Intel Xeon X7542 X 4). 32 GB memory. OS: CentOS 7.5

(2) New HP ProLiant DL380 server, which has 32 cores (Intel Xeon Gold 6130 X 2). 64 GB memory. OS: OpenSUSE Leap 15

After installing OS and Intel MPI, I compiled intel MPI benchmark and ran it:

$ mpirun -np 4 ./IMB-EXT

It is quite surprising that I find the same error when running IMB-EXT and IMB-RMA, though I have a different OS and everything (even GCC version used to compile Intel MPI benchmark is different -- in CentOS, I used GCC 6.5.0, and in OpenSUSE, I used GCC 7.3.1).

On the CentOS machine, I get:

#---------------------------------------------------
# Benchmarking Unidir_Put
# #processes = 2
# ( 2 additional processes waiting in MPI_Barrier)
#---------------------------------------------------
#
#    MODE: AGGREGATE
#
       #bytes #repetitions      t[usec]   Mbytes/sec
            0         1000         0.05         0.00
            4         1000        30.56         0.13
            8         1000        31.53         0.25
           16         1000        30.99         0.52
           32         1000        30.93         1.03
           64         1000        30.30         2.11
          128         1000        30.31         4.22

and on the OpenSUSE machine, I get

#---------------------------------------------------
# Benchmarking Unidir_Put
# #processes = 2
# ( 2 additional processes waiting in MPI_Barrier)
#---------------------------------------------------
#
#    MODE: AGGREGATE
#
       #bytes #repetitions      t[usec]   Mbytes/sec
            0         1000         0.04         0.00
            4         1000        14.40         0.28
            8         1000        14.04         0.57
           16         1000        14.10         1.13
           32         1000        13.96         2.29
           64         1000        13.98         4.58
          128         1000        14.08         9.09

When I don't use mpirun (which means there is only one process to run IMB-EXT), the benchmark runs through, but Unidir_Put needs >=2 processes, so doesn't help so much, and I also find that the functions with MPI_Put and MPI_Get is extremely slower than I expected (from my experience). Also, using MVAPICH on the OpenSUSE machine did not help. The output is:

#---------------------------------------------------
# Benchmarking Unidir_Put
# #processes = 2
# ( 6 additional processes waiting in MPI_Barrier)
#---------------------------------------------------
#
#    MODE: AGGREGATE
#
       #bytes #repetitions      t[usec]   Mbytes/sec
            0         1000         0.03         0.00
            4         1000        17.37         0.23
            8         1000        17.08         0.47
           16         1000        17.23         0.93
           32         1000        17.56         1.82
           64         1000        17.06         3.75
          128         1000        17.20         7.44

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 49213 RUNNING AT iron-0-1
=   EXIT CODE: 139
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Segmentation fault (signal 11)
This typically refers to a problem with your application.
Please see the FAQ page for debugging suggestions

update: I tested OpenMPI, and it goes through smoothly (although my application does not recommend using openmpi, and I still don't understand why Intel MPI or MVAPICH doesn't work...)

#---------------------------------------------------
# Benchmarking Unidir_Put
# #processes = 2
# ( 2 additional processes waiting in MPI_Barrier)
#---------------------------------------------------
#
#    MODE: AGGREGATE
#
       #bytes #repetitions      t[usec]   Mbytes/sec
            0         1000         0.06         0.00
            4         1000         0.23        17.44
            8         1000         0.22        35.82
           16         1000         0.22        72.36
           32         1000         0.22       144.98
           64         1000         0.22       285.76
          128         1000         0.30       430.29
          256         1000         0.39       650.78
          512         1000         0.51      1008.31
         1024         1000         0.84      1214.42
         2048         1000         1.86      1100.29
         4096         1000         7.31       560.59
         8192         1000        15.24       537.67
        16384         1000        15.39      1064.82
        32768         1000        15.70      2086.51
        65536          640        12.31      5324.63
       131072          320        10.24     12795.03
       262144          160        12.49     20993.49
       524288           80        30.21     17356.93
      1048576           40        81.20     12913.67
      2097152           20       199.20     10527.72
      4194304           10       394.02     10644.77

Is there any chance that I am missing something in installing MPI, or installing OS in these servers? Actually, I assume that OS is the problem, but not sure where to start...

Thanks a lot in advance,

Jae

Sascha Gottfried · Answer 1 · 2018-11-22T09:37:50.863

Although this question is well written, you were not explicit about

Intel MPI benchmark (please add header)
Intel MPI
Open MPI
MVAPICH
supported host network fabrics - for each MPI distribution
selected fabric while running MPI benchmark
Compilation settings

Debugging this kind of trouble with disparate host machines, multiple Linux distributions and compiler versions can be quite hard. Remote debugging on StackOverflow is even harder.

First of all ensure reproducibility. This seems to be the case. One of many debugging approaches, the one I would recommend, is to reduce complexity of the system as a whole, test smaller sub-systems and start shifting responsibility to third parties. You may replace self-compiled executables with software packages provided by distribution software/package repositories or third parties like Conda.

Intel recently started to provide its libraries through YUM/APT repos as well as for Conda and PyPI. I found that helps a lot with reproducible deployments of HPC clusters and even runtime/development environments. I recommend to use it for CentOS 7.5.

YUM/APT repository for Intel MKL, Intel IPP, Intel DAAL, and Intel® Distribution for Python* (for Linux*):

Conda* package/ Anaconda Cloud* support (Intel MKL, Intel IPP, Intel DAAL, Intel Distribution for Python):

Installing Intel Distribution for Python and Intel Performance Libraries with Anaconda
Available Intel packages can be viewed here

Install from the Python Package Index (PyPI) using pip (Intel MKL, Intel IPP, Intel DAAL)

Installing the Intel® Distribution for Python* and Intel® Performance Libraries with pip and PyPI

I do not know much about OpenSUSE Leap 15.

Intel MPI benchmark fails when # bytes > 128: IMB-EXT

1 Answers1