0

I have the following setup. 7 Nodes, let's say they are called gauss1 to gauss7. I have a stable connection between gauss1 to gauss6. Just gauss7 is making trouble.

# ibnodes
Ca  : 0x0002c90300f2eef0 ports 2 "gauss1 mlx4_0"
Ca  : 0x0002c90300f2ef20 ports 2 "gauss2 mlx4_0"
Ca  : 0x7cfe900300be5350 ports 1 "gauss3 mlx4_0"
Ca  : 0x7cfe900300be5170 ports 1 "gauss4 mlx4_0"
Ca  : 0x7cfe900300be51a0 ports 1 "gauss5 mlx4_0"
Ca  : 0x248a070300d8f5c0 ports 1 "gauss6 mlx4_0"
Ca  : 0xec0d9a03002baf50 ports 1 "gauss7 mlx4_0"

So all nodes seem to be registered on the switch. The port state is for gauss1 to gauss6 on ACTIVE. Just on gauss7 I have the port state INIT.

ibv_devinfo on gauss7 says:

hca_id: mlx4_0
    transport:          InfiniBand (0)
    fw_ver:             2.42.5000
    node_guid:          ec0d:9a03:002b:af50
    sys_image_guid:         ec0d:9a03:002b:af53
    vendor_id:          0x02c9
    vendor_part_id:         4099
    hw_ver:             0x0
    board_id:           MT_1100120019
    phys_port_cnt:          1
        port:   1
            state:          PORT_INIT (2)
            max_mtu:        4096 (5)
            active_mtu:     4096 (5)
            sm_lid:         3
            port_lid:       9
            port_lmc:       0x00
            link_layer:     InfiniBand

I have installed opensm on gauss7 as well and it says it's in STANDBY:

Feb 02 20:15:36 gauss7 opensm-launch[355306]: Using default GUID 0xec0d9a03002baf51
Feb 02 20:15:36 gauss7 OpenSM[355309]: Entering DISCOVERING state
Feb 02 20:15:36 gauss7 opensm-launch[355306]: Entering DISCOVERING state
Feb 02 20:15:36 gauss7 OpenSM[355309]: Entering STANDBY state
Feb 02 20:15:36 gauss7 opensm-launch[355306]: Entering STANDBY state

My Question: How can I set the port on gauss7 to ACTIVE and have a connection between all 7 nodes?

Thomas
  • 4,225
  • 5
  • 23
  • 28
Shibumi
  • 151
  • 1
  • 6
  • 1
    Double check cabling by changing the connection of e.g. `gauss6` with `gauss7`. If the problem switches to `gauss6` it is a bad cable or port on the switch or just reinsert the cable. If the error stays with `gauss7` investigate on the server and power it off and on again. `opensm` does not need to run on each node and there can be only one active, the other opensm instance are STANDBY. – Thomas Feb 02 '18 at 19:26
  • The physical port says `LinkUp`. Doesn't this mean that the physical connection is ok? The problem is btw that I can't access gauss7 from every other server (gauss1-gauss6) Between all nodes is a physical infiniband switch. – Shibumi Feb 02 '18 at 19:35

1 Answers1

1

A reboot of gauss7 solved the issue.

Shibumi
  • 151
  • 1
  • 6