0

I run a small cluster for mpi computing, and recently we acquired some EDR Infiniband Equipment. I am testing it with two computers, connected through an unmanaged switch, and I am able to run a test program with 30 processes in both nodes. Monitoring the Infiband counters in

/sys/class/infiniband/mlx5_0/ports/1/counters

I can see data is flowing, so far so good. Now I would like to run the same test with the ethernet TCP connection in order to get some data about the real improvement Inifiniband provides, but I am unable to make it work. I have tried all sort of mca parameters, like:

/usr/lib64/openmpi3/bin/mpirun --hostfile hostfile.txt --mca btl ^openib -np 30 nek5000
/usr/lib64/openmpi3/bin/mpirun --hostfile hostfile.txt --mca btl tcp,self,vader -np 30 nek5000
/usr/lib64/openmpi3/bin/mpirun --hostfile hostfile.txt --mca btl tcp,self,vader --mca btl_openib_if_exclude mlx5_0  -np 30 nek5000

All without luck. When the program runs, Infiniband counters are increasing and the running time is no different that the Infiniband one. Any toughts? Both computers run Centos 7, and this is the ompi_info result:

                 Package: Open MPI mockbuild@x86-02.bsys.centos.org
                          Distribution
                Open MPI: 3.1.3
  Open MPI repo revision: v3.1.3
   Open MPI release date: Oct 29, 2018
                Open RTE: 3.1.3
  Open RTE repo revision: v3.1.3
   Open RTE release date: Oct 29, 2018
                    OPAL: 3.1.3
      OPAL repo revision: v3.1.3
       OPAL release date: Oct 29, 2018
                 MPI API: 3.1.0
            Ident string: 3.1.3
                  Prefix: /usr/lib64/openmpi3
 Configured architecture: x86_64-unknown-linux-gnu
          Configure host: x86-02.bsys.centos.org
           Configured by: mockbuild
           Configured on: Thu Aug 22 16:51:48 UTC 2019
          Configure host: x86-02.bsys.centos.org
  Configure command line: '--prefix=/usr/lib64/openmpi3'
                          '--mandir=/usr/share/man/openmpi3-x86_64'
                          '--includedir=/usr/include/openmpi3-x86_64'
                          '--sysconfdir=/etc/openmpi3-x86_64'
                          '--disable-silent-rules' '--enable-builtin-atomics'
                          '--enable-mpi-cxx' '--with-sge' '--with-valgrind'
                          '--enable-memchecker' '--with-hwloc=/usr'
                          '--with-ucx' 'CC=gcc' 'CXX=g++'
                          'LDFLAGS=-Wl,-z,relro ' 'CFLAGS= -O2 -g -pipe -Wall
                          -Wp,-D_FORTIFY_SOURCE=2 -fexceptions
                          -fstack-protector-strong --param=ssp-buffer-size=4
                          -grecord-gcc-switches   -m64 -mtune=generic'
                          'CXXFLAGS= -O2 -g -pipe -Wall
                          -Wp,-D_FORTIFY_SOURCE=2 -fexceptions
                          -fstack-protector-strong --param=ssp-buffer-size=4
                          -grecord-gcc-switches   -m64 -mtune=generic'
                          'FC=gfortran' 'FCFLAGS= -O2 -g -pipe -Wall
                          -Wp,-D_FORTIFY_SOURCE=2 -fexceptions
                          -fstack-protector-strong --param=ssp-buffer-size=4
                          -grecord-gcc-switches   -m64 -mtune=generic'
                Built by: mockbuild
                Built on: Thu Aug 22 16:55:56 UTC 2019
              Built host: x86-02.bsys.centos.org
              C bindings: yes
            C++ bindings: yes
             Fort mpif.h: yes (all)
            Fort use mpi: yes (limited: overloading)
       Fort use mpi size: deprecated-ompi-info-value
        Fort use mpi_f08: no
 Fort mpi_f08 compliance: The mpi_f08 module was not built
  Fort mpi_f08 subarrays: no
           Java bindings: no
  Wrapper compiler rpath: runpath
              C compiler: gcc
     C compiler absolute: /usr/bin/gcc
  C compiler family name: GNU
      C compiler version: 4.8.5
            C++ compiler: g++
   C++ compiler absolute: /usr/bin/g++
           Fort compiler: gfortran
       Fort compiler abs: /usr/bin/gfortran
         Fort ignore TKR: no
   Fort 08 assumed shape: no
      Fort optional args: no
          Fort INTERFACE: yes
    Fort ISO_FORTRAN_ENV: yes
       Fort STORAGE_SIZE: no
      Fort BIND(C) (all): no
      Fort ISO_C_BINDING: yes
 Fort SUBROUTINE BIND(C): no
       Fort TYPE,BIND(C): no
 Fort T,BIND(C,name="a"): no
            Fort PRIVATE: no
          Fort PROTECTED: no
           Fort ABSTRACT: no
       Fort ASYNCHRONOUS: no
          Fort PROCEDURE: no
         Fort USE...ONLY: no
           Fort C_FUNLOC: no
 Fort f08 using wrappers: no
         Fort MPI_SIZEOF: no
             C profiling: yes
           C++ profiling: yes
   Fort mpif.h profiling: yes
  Fort use mpi profiling: yes
   Fort use mpi_f08 prof: no
          C++ exceptions: no
          Thread support: posix (MPI_THREAD_MULTIPLE: yes, OPAL support: yes,
                          OMPI progress: no, ORTE progress: yes, Event lib:
                          yes)
           Sparse Groups: no
  Internal debug support: no
  MPI interface warnings: yes
     MPI parameter check: runtime
Memory profiling support: no
Memory debugging support: no
              dl support: yes
   Heterogeneous support: no
 mpirun default --prefix: no
       MPI_WTIME support: native
     Symbol vis. support: yes
   Host topology support: yes
          MPI extensions: affinity, cuda
   FT Checkpoint support: no (checkpoint thread: no)
   C/R Enabled Debugging: no
  MPI_MAX_PROCESSOR_NAME: 256
    MPI_MAX_ERROR_STRING: 256
     MPI_MAX_OBJECT_NAME: 64
        MPI_MAX_INFO_KEY: 36
        MPI_MAX_INFO_VAL: 256
       MPI_MAX_PORT_NAME: 1024
  MPI_MAX_DATAREP_STRING: 128
           MCA allocator: basic (MCA v2.1.0, API v2.0.0, Component v3.1.3)
           MCA allocator: bucket (MCA v2.1.0, API v2.0.0, Component v3.1.3)
           MCA backtrace: execinfo (MCA v2.1.0, API v2.0.0, Component v3.1.3)
                 MCA btl: openib (MCA v2.1.0, API v3.0.0, Component v3.1.3)
                 MCA btl: self (MCA v2.1.0, API v3.0.0, Component v3.1.3)
                 MCA btl: tcp (MCA v2.1.0, API v3.0.0, Component v3.1.3)
                 MCA btl: usnic (MCA v2.1.0, API v3.0.0, Component v3.1.3)
                 MCA btl: vader (MCA v2.1.0, API v3.0.0, Component v3.1.3)
            MCA compress: bzip (MCA v2.1.0, API v2.0.0, Component v3.1.3)
            MCA compress: gzip (MCA v2.1.0, API v2.0.0, Component v3.1.3)
                 MCA crs: none (MCA v2.1.0, API v2.0.0, Component v3.1.3)
                  MCA dl: dlopen (MCA v2.1.0, API v1.0.0, Component v3.1.3)
               MCA event: libevent2022 (MCA v2.1.0, API v2.0.0, Component
                          v3.1.3)
               MCA hwloc: external (MCA v2.1.0, API v2.0.0, Component v3.1.3)
                  MCA if: linux_ipv6 (MCA v2.1.0, API v2.0.0, Component
                          v3.1.3)
                  MCA if: posix_ipv4 (MCA v2.1.0, API v2.0.0, Component
                          v3.1.3)
         MCA installdirs: env (MCA v2.1.0, API v2.0.0, Component v3.1.3)
         MCA installdirs: config (MCA v2.1.0, API v2.0.0, Component v3.1.3)
          MCA memchecker: valgrind (MCA v2.1.0, API v2.0.0, Component v3.1.3)
              MCA memory: patcher (MCA v2.1.0, API v2.0.0, Component v3.1.3)
               MCA mpool: hugepage (MCA v2.1.0, API v3.0.0, Component v3.1.3)
             MCA patcher: overwrite (MCA v2.1.0, API v1.0.0, Component
                          v3.1.3)
                MCA pmix: flux (MCA v2.1.0, API v2.0.0, Component v3.1.3)
                MCA pmix: isolated (MCA v2.1.0, API v2.0.0, Component v3.1.3)
                MCA pmix: pmix2x (MCA v2.1.0, API v2.0.0, Component v3.1.3)
               MCA pstat: linux (MCA v2.1.0, API v2.0.0, Component v3.1.3)
              MCA rcache: grdma (MCA v2.1.0, API v3.3.0, Component v3.1.3)
           MCA reachable: weighted (MCA v2.1.0, API v2.0.0, Component v3.1.3)
               MCA shmem: mmap (MCA v2.1.0, API v2.0.0, Component v3.1.3)
               MCA shmem: posix (MCA v2.1.0, API v2.0.0, Component v3.1.3)
               MCA shmem: sysv (MCA v2.1.0, API v2.0.0, Component v3.1.3)
               MCA timer: linux (MCA v2.1.0, API v2.0.0, Component v3.1.3)
                 MCA dfs: app (MCA v2.1.0, API v1.0.0, Component v3.1.3)
                 MCA dfs: orted (MCA v2.1.0, API v1.0.0, Component v3.1.3)
                 MCA dfs: test (MCA v2.1.0, API v1.0.0, Component v3.1.3)
              MCA errmgr: default_app (MCA v2.1.0, API v3.0.0, Component
                          v3.1.3)
              MCA errmgr: default_hnp (MCA v2.1.0, API v3.0.0, Component
                          v3.1.3)
              MCA errmgr: default_orted (MCA v2.1.0, API v3.0.0, Component
                          v3.1.3)
              MCA errmgr: default_tool (MCA v2.1.0, API v3.0.0, Component
                          v3.1.3)
              MCA errmgr: dvm (MCA v2.1.0, API v3.0.0, Component v3.1.3)
                 MCA ess: env (MCA v2.1.0, API v3.0.0, Component v3.1.3)
                 MCA ess: hnp (MCA v2.1.0, API v3.0.0, Component v3.1.3)
                 MCA ess: pmi (MCA v2.1.0, API v3.0.0, Component v3.1.3)
                 MCA ess: singleton (MCA v2.1.0, API v3.0.0, Component
                          v3.1.3)
                 MCA ess: slurm (MCA v2.1.0, API v3.0.0, Component v3.1.3)
                 MCA ess: tool (MCA v2.1.0, API v3.0.0, Component v3.1.3)
               MCA filem: raw (MCA v2.1.0, API v2.0.0, Component v3.1.3)
             MCA grpcomm: direct (MCA v2.1.0, API v3.0.0, Component v3.1.3)
                 MCA iof: hnp (MCA v2.1.0, API v2.0.0, Component v3.1.3)
                 MCA iof: orted (MCA v2.1.0, API v2.0.0, Component v3.1.3)
                 MCA iof: tool (MCA v2.1.0, API v2.0.0, Component v3.1.3)
            MCA notifier: syslog (MCA v2.1.0, API v1.0.0, Component v3.1.3)
                MCA odls: default (MCA v2.1.0, API v2.0.0, Component v3.1.3)
                 MCA oob: tcp (MCA v2.1.0, API v2.0.0, Component v3.1.3)
                 MCA oob: ud (MCA v2.1.0, API v2.0.0, Component v3.1.3)
                 MCA plm: isolated (MCA v2.1.0, API v2.0.0, Component v3.1.3)
                 MCA plm: rsh (MCA v2.1.0, API v2.0.0, Component v3.1.3)
                 MCA plm: slurm (MCA v2.1.0, API v2.0.0, Component v3.1.3)
                 MCA ras: gridengine (MCA v2.1.0, API v2.0.0, Component
                          v3.1.3)
                 MCA ras: simulator (MCA v2.1.0, API v2.0.0, Component
                          v3.1.3)
                 MCA ras: slurm (MCA v2.1.0, API v2.0.0, Component v3.1.3)
                MCA regx: fwd (MCA v2.1.0, API v1.0.0, Component v3.1.3)
                MCA regx: reverse (MCA v2.1.0, API v1.0.0, Component v3.1.3)
               MCA rmaps: mindist (MCA v2.1.0, API v2.0.0, Component v3.1.3)
               MCA rmaps: ppr (MCA v2.1.0, API v2.0.0, Component v3.1.3)
               MCA rmaps: rank_file (MCA v2.1.0, API v2.0.0, Component
                          v3.1.3)
               MCA rmaps: resilient (MCA v2.1.0, API v2.0.0, Component
                          v3.1.3)
               MCA rmaps: round_robin (MCA v2.1.0, API v2.0.0, Component
                          v3.1.3)
               MCA rmaps: seq (MCA v2.1.0, API v2.0.0, Component v3.1.3)
                 MCA rml: ofi (MCA v2.1.0, API v3.0.0, Component v3.1.3)
                 MCA rml: oob (MCA v2.1.0, API v3.0.0, Component v3.1.3)
              MCA routed: binomial (MCA v2.1.0, API v3.0.0, Component v3.1.3)
              MCA routed: debruijn (MCA v2.1.0, API v3.0.0, Component v3.1.3)
              MCA routed: direct (MCA v2.1.0, API v3.0.0, Component v3.1.3)
              MCA routed: radix (MCA v2.1.0, API v3.0.0, Component v3.1.3)
                 MCA rtc: hwloc (MCA v2.1.0, API v1.0.0, Component v3.1.3)
              MCA schizo: flux (MCA v2.1.0, API v1.0.0, Component v3.1.3)
              MCA schizo: ompi (MCA v2.1.0, API v1.0.0, Component v3.1.3)
              MCA schizo: orte (MCA v2.1.0, API v1.0.0, Component v3.1.3)
              MCA schizo: slurm (MCA v2.1.0, API v1.0.0, Component v3.1.3)
               MCA state: app (MCA v2.1.0, API v1.0.0, Component v3.1.3)
               MCA state: dvm (MCA v2.1.0, API v1.0.0, Component v3.1.3)
               MCA state: hnp (MCA v2.1.0, API v1.0.0, Component v3.1.3)
               MCA state: novm (MCA v2.1.0, API v1.0.0, Component v3.1.3)
               MCA state: orted (MCA v2.1.0, API v1.0.0, Component v3.1.3)
               MCA state: tool (MCA v2.1.0, API v1.0.0, Component v3.1.3)
                 MCA bml: r2 (MCA v2.1.0, API v2.0.0, Component v3.1.3)
                MCA coll: basic (MCA v2.1.0, API v2.0.0, Component v3.1.3)
                MCA coll: inter (MCA v2.1.0, API v2.0.0, Component v3.1.3)
                MCA coll: libnbc (MCA v2.1.0, API v2.0.0, Component v3.1.3)
                MCA coll: monitoring (MCA v2.1.0, API v2.0.0, Component
                          v3.1.3)
                MCA coll: self (MCA v2.1.0, API v2.0.0, Component v3.1.3)
                MCA coll: sm (MCA v2.1.0, API v2.0.0, Component v3.1.3)
                MCA coll: spacc (MCA v2.1.0, API v2.0.0, Component v3.1.3)
                MCA coll: sync (MCA v2.1.0, API v2.0.0, Component v3.1.3)
                MCA coll: tuned (MCA v2.1.0, API v2.0.0, Component v3.1.3)
                MCA fbtl: posix (MCA v2.1.0, API v2.0.0, Component v3.1.3)
               MCA fcoll: dynamic (MCA v2.1.0, API v2.0.0, Component v3.1.3)
               MCA fcoll: dynamic_gen2 (MCA v2.1.0, API v2.0.0, Component
                          v3.1.3)
               MCA fcoll: individual (MCA v2.1.0, API v2.0.0, Component
                          v3.1.3)
               MCA fcoll: static (MCA v2.1.0, API v2.0.0, Component v3.1.3)
               MCA fcoll: two_phase (MCA v2.1.0, API v2.0.0, Component
                          v3.1.3)
                  MCA fs: ufs (MCA v2.1.0, API v2.0.0, Component v3.1.3)
                  MCA io: romio314 (MCA v2.1.0, API v2.0.0, Component v3.1.3)
                  MCA io: ompio (MCA v2.1.0, API v2.0.0, Component v3.1.3)
                 MCA mtl: ofi (MCA v2.1.0, API v2.0.0, Component v3.1.3)
                 MCA mtl: psm (MCA v2.1.0, API v2.0.0, Component v3.1.3)
                 MCA mtl: psm2 (MCA v2.1.0, API v2.0.0, Component v3.1.3)
                 MCA osc: ucx (MCA v2.1.0, API v3.0.0, Component v3.1.3)
                 MCA osc: monitoring (MCA v2.1.0, API v3.0.0, Component
                          v3.1.3)
                 MCA osc: pt2pt (MCA v2.1.0, API v3.0.0, Component v3.1.3)
                 MCA osc: rdma (MCA v2.1.0, API v3.0.0, Component v3.1.3)
                 MCA osc: sm (MCA v2.1.0, API v3.0.0, Component v3.1.3)
                 MCA pml: v (MCA v2.1.0, API v2.0.0, Component v3.1.3)
                 MCA pml: cm (MCA v2.1.0, API v2.0.0, Component v3.1.3)
                 MCA pml: monitoring (MCA v2.1.0, API v2.0.0, Component
                          v3.1.3)
                 MCA pml: ob1 (MCA v2.1.0, API v2.0.0, Component v3.1.3)
                 MCA pml: ucx (MCA v2.1.0, API v2.0.0, Component v3.1.3)
                 MCA rte: orte (MCA v2.1.0, API v2.0.0, Component v3.1.3)
            MCA sharedfp: individual (MCA v2.1.0, API v2.0.0, Component
                          v3.1.3)
            MCA sharedfp: sm (MCA v2.1.0, API v2.0.0, Component v3.1.3)
            MCA sharedfp: lockedfile (MCA v2.1.0, API v2.0.0, Component
                          v3.1.3)
                MCA topo: basic (MCA v2.1.0, API v2.2.0, Component v3.1.3)
                MCA topo: treematch (MCA v2.1.0, API v2.2.0, Component
                          v3.1.3)
           MCA vprotocol: pessimist (MCA v2.1.0, API v2.0.0, Component
                          v3.1.3)

Thank you!

xaviote
  • 1
  • 4
  • Try `mpirun --mca btl tcp,self,vader -- mca pml ob1 ...` – Gilles Gouaillardet May 31 '20 at 23:07
  • Thank you! now it doesn't use infiniband but tcp neither. One of the nodes is dual homed and both nodes have virbr0 interfaces from qemu which seems to be a problem with mpi communication. I'm on it... – xaviote Jun 01 '20 at 13:22
  • After applying Gilles suggestion now the program doesn't run. I cannot make it work even if I use only one node. Increasing debug level shows a lot of: btl:tcp: path from 192.168.192.100 to 192.168.192.100: IPV4 PRIVATE SAME NETWORK but the program hangs. – xaviote Jun 03 '20 at 06:24
  • If you run on a single node, `btl/tcp` should not be used with the previous command, so this is really odd. Can you confirm `mpirun --mca btl vader,self --mca pml ob1 -np 1 ...` is working on a single node? Also, do you have some kind of NAT between nodes? – Gilles Gouaillardet Jun 03 '20 at 06:33
  • No, I don't have any NAT (not that I know). In a single node, I can run /usr/lib64/openmpi3/bin/mpirun --host got02 --mca btl self,vader --oversubscribe --mca pml ob1 -np 30 ./hello_c and ring_c (examples from openmpi3) but not the program I'm testing, nek5000 (which worked with infiniband without problems). It hangs in variable points of execution at the very start of it. I am missing something? Could it be an issue with vader? – xaviote Jun 03 '20 at 09:38
  • It could be, what if you use `btl/tcp` on one node : `mpirun --mca pml ob1 --mca btl tcp, self, ...`? – Gilles Gouaillardet Jun 03 '20 at 09:41
  • 1
    In fact, running /openmpi3/bin/mpirun --host got02 --mca btl ^vader --oversubscribe --mca pml ob1 -np 30 ./nek5000 worked! – xaviote Jun 03 '20 at 09:45

0 Answers0