I'm running a rocks 6.2 cluster running Centos 6.8. Mainly it consists of the head node, compute nodes and file servers. They're connected through a 10Gb local switch and also all on the datacenter 1Gb switch (the compute nodes are only on the local switch and use the head node as a gateway to datacenter switch). So as you can imagine I mount my file servers over the local switch.
I recently added a new volume to one of my file servers (cslim), rebooted and made a couple changes to get file ownerships showing properly over NFSv4 (changes to /etc/idmapd.conf and /etc/default/nfs-common which I've done successfully for all other servers).
The two exported volumes mount correctly onto the head node and a couple other servers I tried in the cluster. However I can't get the compute nodes to mount the volumes over the local switch. I just get mount.nfs: Connection timed out
. Mounting onto the compute nodes over the datacenter switch does work.
I can't say for sure if the server was previously mounting to the compute nodes, because the only volume on there until recently was for archive and admin files that I handled through the head node.
Additionally, the compute nodes cannot ping or ssh to cslim over the local switch, but the head node and other servers can do so over the local switch. The compute nodes can ping and ssh to cslim over the datacenter switch, and to other servers on the local switch. Trying ssh yields ssh: connect to host cslim-local port 22: No route to host
.
The compute nodes resolve cslim's local hostname correctly (cslim-local) and everything fails also using IP address itself.
traceroute from compute node to cslim is timing out, if I'm interpreting it correctly:
[root@compute-0-0 ~]# traceroute cslim-local
traceroute to cslim-local (10.1.1.11), 30 hops max, 60 byte packets
1 compute-0-0.local (10.1.255.254) 3000.757 ms !H 3000.755 ms !H 3000.752 ms !H
[root@compute-0-0 ~]# traceroute picsl-local
traceroute to picsl-local (10.1.1.16), 30 hops max, 60 byte packets
1 picsl-local.local (10.1.1.16) 0.212 ms 0.209 ms 0.204 ms
I've disabled the firewall on cslim, but to no avail. I've rebooted cslim, restarted nfs and rpcidmapd services. cslim is exporting to the compute nodes at 10.1.0.0/255.255.0.0:
[root@cslim ~]# exportfs
<snip>
/mnt/data/archive 10.1.0.0/255.255.0.0
/mnt/data-jux 10.1.0.0/255.255.0.0
There's nothing in /var/log/messages or /var/log/secure on cslim or compute nodes when the mount fails.
Does anyone have any ideas?
Update:
traceroute is timing out, with 'host unreachable':
[root@compute-0-0 ~]# traceroute cslim-local
traceroute to cslim-local (10.1.1.11), 30 hops max, 60 byte packets
1 compute-0-0.local (10.1.255.254) 3000.757 ms !H 3000.755 ms !H 3000.752 ms !H
this shows another server on the same switch as reachable:
[root@compute-0-0 ~]# traceroute picsl-local
traceroute to picsl-local (10.1.1.16), 30 hops max, 60 byte packets
1 picsl-local.local (10.1.1.16) 0.212 ms 0.209 ms 0.204 ms
SELinux was set to enforcing on cslim. Setting to permissive has not helped.
Firewall has been stopped on compute node, and that hasn't helped either.
netstat output
On compute node:
[root@compute-0-0 ~]# netstat -rn
Kernel IP routing table
Destination Gateway Genmask Flags MSS Window irtt Iface
255.255.255.255 0.0.0.0 255.255.255.255 UH 0 0 0 p1p1
170.212.169.128 10.1.1.1 255.255.255.255 UGH 0 0 0 p1p1
224.0.0.0 0.0.0.0 255.255.255.0 U 0 0 0 p1p1
10.1.0.0 0.0.0.0 255.255.0.0 U 0 0 0 p1p1
169.254.0.0 0.0.0.0 255.255.0.0 U 0 0 0 p1p1
0.0.0.0 10.1.1.1 0.0.0.0 UG 0 0 0 p1p1
Note that 10.1.1.1 is the head node.
On cslim:
[root@cslim ~]# netstat -rn
Kernel IP routing table
Destination Gateway Genmask Flags MSS Window irtt Iface
170.212.169.0 0.0.0.0 255.255.255.0 U 0 0 0 eth0
10.1.1.0 0.0.0.0 255.255.255.0 U 0 0 0 bond0
169.254.0.0 0.0.0.0 255.255.0.0 U 0 0 0 eth0
169.254.0.0 0.0.0.0 255.255.0.0 U 0 0 0 bond0
0.0.0.0 170.212.169.2 0.0.0.0 UG 0 0 0 eth0
On picsl (picsl-local shown above in traceroute test. This server can mount cslim volumes over the local switch):
[root@picsl-cluster ~]# netstat -rn
Kernel IP routing table
Destination Gateway Genmask Flags MSS Window irtt Iface
170.212.169.225 10.1.1.1 255.255.255.255 UGH 0 0 0 eth0
170.212.169.0 0.0.0.0 255.255.255.0 U 0 0 0 em1
192.168.122.0 0.0.0.0 255.255.255.0 U 0 0 0 virbr0
10.1.0.0 0.0.0.0 255.255.0.0 U 0 0 0 eth0
169.254.0.0 0.0.0.0 255.255.0.0 U 0 0 0 eth0
169.254.0.0 0.0.0.0 255.255.0.0 U 0 0 0 em1
0.0.0.0 170.212.169.2 0.0.0.0 UG 0 0 0 em1
ifconfig
On compute node:
[root@compute-0-0 ~]# ifconfig -a
em1 Link encap:Ethernet HWaddr 90:B1:1C:28:D8:27
BROADCAST MULTICAST MTU:1500 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:0 (0.0 b) TX bytes:0 (0.0 b)
Interrupt:16
em2 Link encap:Ethernet HWaddr 90:B1:1C:28:D8:28
BROADCAST MULTICAST MTU:1500 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:0 (0.0 b) TX bytes:0 (0.0 b)
Interrupt:17
lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
inet6 addr: ::1/128 Scope:Host
UP LOOPBACK RUNNING MTU:65536 Metric:1
RX packets:360953790 errors:0 dropped:0 overruns:0 frame:0
TX packets:360953790 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:1108715304547 (1.0 TiB) TX bytes:1108715304547 (1.0 TiB)
p1p1 Link encap:Ethernet HWaddr 00:10:18:F0:31:A0
inet addr:10.1.255.254 Bcast:10.1.255.255 Mask:255.255.0.0
inet6 addr: fe80::210:18ff:fef0:31a0/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:1654711736 errors:0 dropped:0 overruns:0 frame:0
TX packets:2560600760 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:2054533957261 (1.8 TiB) TX bytes:3252638973302 (2.9 TiB)
Interrupt:80 Memory:d0000000-d07fffff
p1p2 Link encap:Ethernet HWaddr 00:10:18:F0:31:A2
BROADCAST MULTICAST MTU:1500 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:0 (0.0 b) TX bytes:0 (0.0 b)
Interrupt:84 Memory:d1000000-d17fffff
On cslim:
[root@cslim ~]# ifconfig -a
bond0 Link encap:Ethernet HWaddr 00:21:28:3D:6D:03
inet addr:10.1.1.11 Bcast:10.1.1.255 Mask:255.255.255.0
inet6 addr: fe80::221:28ff:fe3d:6d03/64 Scope:Link
UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1
RX packets:351143643 errors:0 dropped:0 overruns:0 frame:0
TX packets:22812517 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:495999344326 (461.9 GiB) TX bytes:1721189388 (1.6 GiB)
eth0 Link encap:Ethernet HWaddr 00:21:28:3D:6D:02
inet addr:170.212.169.151 Bcast:170.212.169.255 Mask:255.255.255.0
inet6 addr: fe80::221:28ff:fe3d:6d02/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:22690383 errors:1152 dropped:0 overruns:1150 frame:2
TX packets:2716530 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:33135278971 (30.8 GiB) TX bytes:227883477 (217.3 MiB)
eth1 Link encap:Ethernet HWaddr 00:21:28:3D:6D:03
inet6 addr: fe80::221:28ff:fe3d:6d03/64 Scope:Link
UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1
RX packets:2766456 errors:0 dropped:0 overruns:0 frame:0
TX packets:22803974 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:380681543 (363.0 MiB) TX bytes:1720423086 (1.6 GiB)
eth2 Link encap:Ethernet HWaddr 00:21:28:3D:6D:04
inet6 addr: fe80::221:28ff:fe3d:6d03/64 Scope:Link
UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1
RX packets:345621248 errors:444 dropped:0 overruns:444 frame:0
TX packets:8492 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:495244880097 (461.2 GiB) TX bytes:757968 (740.2 KiB)
eth3 Link encap:Ethernet HWaddr 00:21:28:3D:6D:05
inet6 addr: fe80::221:28ff:fe3d:6d03/64 Scope:Link
UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1
RX packets:2755939 errors:444 dropped:0 overruns:444 frame:0
TX packets:51 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:373782686 (356.4 MiB) TX bytes:8334 (8.1 KiB)
lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
inet6 addr: ::1/128 Scope:Host
UP LOOPBACK RUNNING MTU:65536 Metric:1
RX packets:3512 errors:0 dropped:0 overruns:0 frame:0
TX packets:3512 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:517649 (505.5 KiB) TX bytes:517649 (505.5 KiB)