0

community!

For instance, we have 2 nodes with MPI interconnection and the next set of interfaces: ib0 (InfiniBand), eth10 (Ethernet) and lo.

To run MPI on mlx4 device with RDMA we use the next command:

mpirun --allow-run-as-root --host host1,host2 --mca btl openib,self,vader --mca btl_openib_allow_ib true --mca btl_openib_if_include mlx4_0:1 ~/hellompi

Now we want to compare RDMA and non-RDMA version. The most obvious command to run TCP-mode is:

 mpirun --allow-run-as-root --host host1,host2 --mca btl "^openib" ~/hellompi

However, it returns the message below:

WARNING: Open MPI failed to TCP connect to a peer MPI process.  This
should not happen.

Your Open MPI job may now hang or fail.

  Local host: h2
  PID:        3219
  Message:    connect() to 12.12.12.3:1024 failed
  Error:      Operation now in progress (115)

As ifconfig informs, eth10 has 12.12.12.2 and 12.12.12.3 inet addrs for two hosts.

Let's add --mca btl_tcp_if_include eth10 parameter to MPI running settings... But no progress, still connection error!

So what's the correct way to run it without ib0 interface and mlx4 device? In another words, how to run MPI without RDMA feature on TCP interface only?

Thanks.

  • it looks like there is a firewall running on/between the hosts! – Gilles Gouaillardet Sep 07 '22 at 13:14
  • You can quickly try creating a TCP server on one of the nodes using "sudo nc -l 1024" and connect to it from the other node using "nc 12.12.12.2 1024" and see if that works to rule out any firewall issues and such. – Ankush Jain Sep 23 '22 at 16:48

0 Answers0