So I'm trying to set up an Infiniband network alongside my Ethernet network.
I have 10 compute nodes and one conductor node. All 11 machines are running CentOS and are plugged in to an Infiniband switch and an Ethernet switch.
Ethernet: 192.168.1.0/24 Infiniband: 192.168.2.0/24
The conductor node is 192.168.1.125 (Ethernet) and 192.168.2.125 (Infiniband). Compute node X is 192.168.1.10X (Ethernet) and 192.1.168.2.10X (Ethernet). All IP addresses are assigned statically.
So I log in to one of the compute nodes (compute-7):
Here is /etc/sysconfig/network-scripts/ifcfg-em1
DEVICE=em1
ONBOOT=yes
NM_CONTROLLED=no
BOOTPROTO=none
IPV6INIT=no
USERCTL=no
IPADDR=192.168.1.107
NETMASK=255.255.255.0
NETWORK=192.168.1.0
BROADCAST=192.168.1.255
GATEWAY=192.168.1.125 #via conductor node
DNS1=192.168.1.125 #via conductor node
Here is /etc/sysconfig/network-scripts/ifcfg-ib0
DEVICE=ib0
ONBOOT=yes
NM_CONTROLLED=no
BOOTPROTO=none
IPV6INIT=no
USERCTL=no
TYPE=InfiniBand
IPADDR=192.168.2.107
NETMASK=255.255.255.0
NETWORK=192.168.2.0
BROADCAST=192.168.2.255
When I do:
sudo network restart
on this compute node, here is ifconfig -a
:
em1 Link encap:Ethernet HWaddr xx:xx:xx:xx:3A:FB
inet addr:192.168.1.107 Bcast:192.168.1.255 Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:1236641045 errors:0 dropped:0 overruns:0 frame:0
TX packets:1239585124 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:1561224959733 (1.4 TiB) TX bytes:1560979085053 (1.4 TiB)
Memory:91220000-91240000
ib0 Link encap:InfiniBand HWaddr 80:00:00:48:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00
inet addr:192.168.2.107 Bcast:192.168.2.255 Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:2044 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:256
RX bytes:0 (0.0 b) TX bytes:0 (0.0 b)
And route -nn
gives:
Kernel IP routing table
Destination Gateway Genmask Flags Metric Ref Use Iface
192.168.2.0 0.0.0.0 255.255.255.0 U 0 0 0 ib0
192.168.1.0 0.0.0.0 255.255.255.0 U 0 0 0 em1
169.254.0.0 0.0.0.0 255.255.0.0 U 1002 0 0 em1
169.254.0.0 0.0.0.0 255.255.0.0 U 1004 0 0 ib0
0.0.0.0 192.168.1.125 0.0.0.0 UG 0 0 0 em1
This is not what I want! I want 192.168.2.107 (compute node 7) to be able to talk to 192.168.2.108 (compute node 8) via the 192.168.2.x network. The above route is incorrect!
My Infiniband nodes can't talk to each other in this case... - requests to the 192.168.2.0/24 subnet are routed via 192.168.1.125 (Ethernet) which is very slow...
I have been trying to set the files
/etc/sysconfig/network-scripts/route-em1
and
/etc/sysconfig/network-scripts/route-ib0
With lines like:
192.168.1.0 netmask 255.255.255.0 gw 192.168.1.125 dev em1
192.168.2.0 netmask 255.255.255.0 gw 192.168.2.125 dev ib0
But every time I restart the network, I get the wrong routing...
Can anyone please help me as to how I might get the correct routing?
I'm afraid I don't have a complete network understanding and am finding I'm "hacking" a lot here...
Can anyone help me? All I want to do is to be able to do ssh ostrich@compute-8-ib
(Infiniband) the way I currently can do ssh ostrich@compute-8
(Ethernet)
Once I have a static network figured out, I'll do it all using DHCP and named, but for now; I'm just focussing on getting it right statically.
@Frederic Nielsen:
Here is the routing table on the conductor node:
192.168.2.0 0.0.0.0 255.255.255.0 U 0 0 0 ib0
192.168.1.0 0.0.0.0 255.255.255.0 U 0 0 0 em1
137.43.92.0 0.0.0.0 255.255.254.0 U 0 0 0 em2
169.254.0.0 0.0.0.0 255.255.0.0 U 1002 0 0 em1
169.254.0.0 0.0.0.0 255.255.0.0 U 1003 0 0 em2
169.254.0.0 0.0.0.0 255.255.0.0 U 1004 0 0 ib0
0.0.0.0 187.42.92.1 0.0.0.0 UG 0 0 0 em2