1

So I'm trying to set up an Infiniband network alongside my Ethernet network.

I have 10 compute nodes and one conductor node. All 11 machines are running CentOS and are plugged in to an Infiniband switch and an Ethernet switch.

Ethernet: 192.168.1.0/24 Infiniband: 192.168.2.0/24

The conductor node is 192.168.1.125 (Ethernet) and 192.168.2.125 (Infiniband). Compute node X is 192.168.1.10X (Ethernet) and 192.1.168.2.10X (Ethernet). All IP addresses are assigned statically.

So I log in to one of the compute nodes (compute-7):

Here is /etc/sysconfig/network-scripts/ifcfg-em1

DEVICE=em1
ONBOOT=yes
NM_CONTROLLED=no
BOOTPROTO=none
IPV6INIT=no
USERCTL=no

IPADDR=192.168.1.107
NETMASK=255.255.255.0
NETWORK=192.168.1.0
BROADCAST=192.168.1.255
GATEWAY=192.168.1.125   #via conductor node
DNS1=192.168.1.125   #via conductor node

Here is /etc/sysconfig/network-scripts/ifcfg-ib0

DEVICE=ib0
ONBOOT=yes
NM_CONTROLLED=no
BOOTPROTO=none
IPV6INIT=no
USERCTL=no
TYPE=InfiniBand

IPADDR=192.168.2.107
NETMASK=255.255.255.0
NETWORK=192.168.2.0
BROADCAST=192.168.2.255

When I do:

sudo network restart on this compute node, here is ifconfig -a:

em1       Link encap:Ethernet  HWaddr xx:xx:xx:xx:3A:FB  
          inet addr:192.168.1.107  Bcast:192.168.1.255  Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:1236641045 errors:0 dropped:0 overruns:0 frame:0
          TX packets:1239585124 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:1561224959733 (1.4 TiB)  TX bytes:1560979085053 (1.4 TiB)
          Memory:91220000-91240000 

ib0       Link encap:InfiniBand  HWaddr 80:00:00:48:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00  
          inet addr:192.168.2.107  Bcast:192.168.2.255  Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST  MTU:2044  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:256 
          RX bytes:0 (0.0 b)  TX bytes:0 (0.0 b)

And route -nn gives:

Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
192.168.2.0     0.0.0.0         255.255.255.0   U     0      0        0 ib0
192.168.1.0     0.0.0.0         255.255.255.0   U     0      0        0 em1
169.254.0.0     0.0.0.0         255.255.0.0     U     1002   0        0 em1
169.254.0.0     0.0.0.0         255.255.0.0     U     1004   0        0 ib0
0.0.0.0         192.168.1.125   0.0.0.0         UG    0      0        0 em1

This is not what I want! I want 192.168.2.107 (compute node 7) to be able to talk to 192.168.2.108 (compute node 8) via the 192.168.2.x network. The above route is incorrect!

My Infiniband nodes can't talk to each other in this case... - requests to the 192.168.2.0/24 subnet are routed via 192.168.1.125 (Ethernet) which is very slow...

I have been trying to set the files

/etc/sysconfig/network-scripts/route-em1

and

/etc/sysconfig/network-scripts/route-ib0

With lines like:

192.168.1.0 netmask 255.255.255.0 gw 192.168.1.125 dev em1

192.168.2.0 netmask 255.255.255.0 gw 192.168.2.125 dev ib0

But every time I restart the network, I get the wrong routing...

Can anyone please help me as to how I might get the correct routing?

I'm afraid I don't have a complete network understanding and am finding I'm "hacking" a lot here...

Can anyone help me? All I want to do is to be able to do ssh ostrich@compute-8-ib(Infiniband) the way I currently can do ssh ostrich@compute-8 (Ethernet)

Once I have a static network figured out, I'll do it all using DHCP and named, but for now; I'm just focussing on getting it right statically.

@Frederic Nielsen:

Here is the routing table on the conductor node:

192.168.2.0     0.0.0.0         255.255.255.0   U     0      0        0 ib0
192.168.1.0     0.0.0.0         255.255.255.0   U     0      0        0 em1
137.43.92.0     0.0.0.0         255.255.254.0   U     0      0        0 em2
169.254.0.0     0.0.0.0         255.255.0.0     U     1002   0        0 em1
169.254.0.0     0.0.0.0         255.255.0.0     U     1003   0        0 em2
169.254.0.0     0.0.0.0         255.255.0.0     U     1004   0        0 ib0
0.0.0.0         187.42.92.1     0.0.0.0         UG    0      0        0 em2
ostrich
  • 71
  • 1
  • 8
  • 3
    The routing table seems correct as requests for the `192.168.2.0` network goes out through the `ib0` interface.. Maybe you should look somewhere else? – Frederik Feb 16 '15 at 16:15
  • When I do `ssh 192.168.2.101` from the conductor node, I can't login? "ssh: connect to host 192.168.2.101 port 22: No route to host" – ostrich Feb 17 '15 at 17:10
  • What does the routing table from the conductor node look like? – Frederik Feb 17 '15 at 18:37
  • 192.168.2.107 and 192.168.2.108 are on the same network - you shouldn't need a route to talk. Are they arping properly? – MaQleod Feb 17 '15 at 22:44
  • @FrederikNielsen Thank you for your comment. I just updated the question with the conductor node's routing table. – ostrich Feb 18 '15 at 20:05
  • Hmm, it looks like it should.. What does the arp table say? – Frederik Feb 18 '15 at 20:07
  • @MaQleod Thank you for your comment. When I do `arp -a` I see nothing on 192.168.2.X. I can see all the 192.168.1.Xs though... – ostrich Feb 18 '15 at 20:07
  • I should say that I can ibping between all GUIDs connected to the switch. – ostrich Feb 18 '15 at 20:18
  • ipoib requires standard arp - you should see ipoib entries in the arp table that will have the GUID and IP of the ib devices on the same network. – MaQleod Feb 18 '15 at 20:24
  • @MaQleod I'm afraid I can't see anything to do with Infiniband with `arp -a` Do you have any suggestions? – ostrich Feb 19 '15 at 09:41

0 Answers0