File server unavailable to part of cluster

Question

I'm running a rocks 6.2 cluster running Centos 6.8. Mainly it consists of the head node, compute nodes and file servers. They're connected through a 10Gb local switch and also all on the datacenter 1Gb switch (the compute nodes are only on the local switch and use the head node as a gateway to datacenter switch). So as you can imagine I mount my file servers over the local switch.

I recently added a new volume to one of my file servers (cslim), rebooted and made a couple changes to get file ownerships showing properly over NFSv4 (changes to /etc/idmapd.conf and /etc/default/nfs-common which I've done successfully for all other servers).

The two exported volumes mount correctly onto the head node and a couple other servers I tried in the cluster. However I can't get the compute nodes to mount the volumes over the local switch. I just get mount.nfs: Connection timed out. Mounting onto the compute nodes over the datacenter switch does work.

I can't say for sure if the server was previously mounting to the compute nodes, because the only volume on there until recently was for archive and admin files that I handled through the head node.

Additionally, the compute nodes cannot ping or ssh to cslim over the local switch, but the head node and other servers can do so over the local switch. The compute nodes can ping and ssh to cslim over the datacenter switch, and to other servers on the local switch. Trying ssh yields ssh: connect to host cslim-local port 22: No route to host.

The compute nodes resolve cslim's local hostname correctly (cslim-local) and everything fails also using IP address itself.

traceroute from compute node to cslim is timing out, if I'm interpreting it correctly:

[root@compute-0-0 ~]# traceroute cslim-local
traceroute to cslim-local (10.1.1.11), 30 hops max, 60 byte packets 
1 compute-0-0.local (10.1.255.254) 3000.757 ms !H 3000.755 ms !H 3000.752 ms !H 

[root@compute-0-0 ~]# traceroute picsl-local 
traceroute to picsl-local (10.1.1.16), 30 hops max, 60 byte packets 
1 picsl-local.local (10.1.1.16) 0.212 ms 0.209 ms 0.204 ms

I've disabled the firewall on cslim, but to no avail. I've rebooted cslim, restarted nfs and rpcidmapd services. cslim is exporting to the compute nodes at 10.1.0.0/255.255.0.0:

[root@cslim ~]# exportfs
<snip>
/mnt/data/archive 10.1.0.0/255.255.0.0
/mnt/data-jux     10.1.0.0/255.255.0.0

There's nothing in /var/log/messages or /var/log/secure on cslim or compute nodes when the mount fails.

Does anyone have any ideas?

Update:

traceroute is timing out, with 'host unreachable':

[root@compute-0-0 ~]# traceroute cslim-local
traceroute to cslim-local (10.1.1.11), 30 hops max, 60 byte packets
 1  compute-0-0.local (10.1.255.254)  3000.757 ms !H  3000.755 ms !H  3000.752 ms !H

this shows another server on the same switch as reachable:

[root@compute-0-0 ~]# traceroute picsl-local
traceroute to picsl-local (10.1.1.16), 30 hops max, 60 byte packets
 1  picsl-local.local (10.1.1.16)  0.212 ms  0.209 ms  0.204 ms

SELinux was set to enforcing on cslim. Setting to permissive has not helped.

Firewall has been stopped on compute node, and that hasn't helped either.

netstat output

On compute node:

[root@compute-0-0 ~]# netstat -rn
Kernel IP routing table
Destination     Gateway         Genmask         Flags   MSS Window  irtt Iface
255.255.255.255 0.0.0.0         255.255.255.255 UH        0 0          0 p1p1
170.212.169.128 10.1.1.1        255.255.255.255 UGH       0 0          0 p1p1
224.0.0.0       0.0.0.0         255.255.255.0   U         0 0          0 p1p1
10.1.0.0        0.0.0.0         255.255.0.0     U         0 0          0 p1p1
169.254.0.0     0.0.0.0         255.255.0.0     U         0 0          0 p1p1
0.0.0.0         10.1.1.1        0.0.0.0         UG        0 0          0 p1p1

Note that 10.1.1.1 is the head node.

On cslim:

[root@cslim ~]# netstat -rn
Kernel IP routing table
Destination     Gateway         Genmask         Flags   MSS Window  irtt Iface
170.212.169.0   0.0.0.0         255.255.255.0   U         0 0          0 eth0
10.1.1.0        0.0.0.0         255.255.255.0   U         0 0          0 bond0
169.254.0.0     0.0.0.0         255.255.0.0     U         0 0          0 eth0
169.254.0.0     0.0.0.0         255.255.0.0     U         0 0          0 bond0
0.0.0.0         170.212.169.2   0.0.0.0         UG        0 0          0 eth0

On picsl (picsl-local shown above in traceroute test. This server can mount cslim volumes over the local switch):

[root@picsl-cluster ~]# netstat -rn
Kernel IP routing table
Destination     Gateway         Genmask         Flags   MSS Window  irtt Iface
170.212.169.225 10.1.1.1        255.255.255.255 UGH       0 0          0 eth0
170.212.169.0   0.0.0.0         255.255.255.0   U         0 0          0 em1
192.168.122.0   0.0.0.0         255.255.255.0   U         0 0          0 virbr0
10.1.0.0        0.0.0.0         255.255.0.0     U         0 0          0 eth0
169.254.0.0     0.0.0.0         255.255.0.0     U         0 0          0 eth0
169.254.0.0     0.0.0.0         255.255.0.0     U         0 0          0 em1
0.0.0.0         170.212.169.2   0.0.0.0         UG        0 0          0 em1

ifconfig

On compute node:

[root@compute-0-0 ~]# ifconfig -a
em1       Link encap:Ethernet  HWaddr 90:B1:1C:28:D8:27  
          BROADCAST MULTICAST  MTU:1500  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:0 (0.0 b)  TX bytes:0 (0.0 b)
          Interrupt:16 

em2       Link encap:Ethernet  HWaddr 90:B1:1C:28:D8:28  
          BROADCAST MULTICAST  MTU:1500  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:0 (0.0 b)  TX bytes:0 (0.0 b)
          Interrupt:17 

lo        Link encap:Local Loopback  
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:65536  Metric:1
          RX packets:360953790 errors:0 dropped:0 overruns:0 frame:0
          TX packets:360953790 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:1108715304547 (1.0 TiB)  TX bytes:1108715304547 (1.0 TiB)

p1p1      Link encap:Ethernet  HWaddr 00:10:18:F0:31:A0  
          inet addr:10.1.255.254  Bcast:10.1.255.255  Mask:255.255.0.0
          inet6 addr: fe80::210:18ff:fef0:31a0/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:1654711736 errors:0 dropped:0 overruns:0 frame:0
          TX packets:2560600760 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:2054533957261 (1.8 TiB)  TX bytes:3252638973302 (2.9 TiB)
          Interrupt:80 Memory:d0000000-d07fffff 

p1p2      Link encap:Ethernet  HWaddr 00:10:18:F0:31:A2  
          BROADCAST MULTICAST  MTU:1500  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:0 (0.0 b)  TX bytes:0 (0.0 b)
          Interrupt:84 Memory:d1000000-d17fffff

On cslim:

[root@cslim ~]# ifconfig -a
bond0     Link encap:Ethernet  HWaddr 00:21:28:3D:6D:03  
          inet addr:10.1.1.11  Bcast:10.1.1.255  Mask:255.255.255.0
          inet6 addr: fe80::221:28ff:fe3d:6d03/64 Scope:Link
          UP BROADCAST RUNNING MASTER MULTICAST  MTU:1500  Metric:1
          RX packets:351143643 errors:0 dropped:0 overruns:0 frame:0
          TX packets:22812517 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:495999344326 (461.9 GiB)  TX bytes:1721189388 (1.6 GiB)

eth0      Link encap:Ethernet  HWaddr 00:21:28:3D:6D:02  
          inet addr:170.212.169.151  Bcast:170.212.169.255  Mask:255.255.255.0
          inet6 addr: fe80::221:28ff:fe3d:6d02/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:22690383 errors:1152 dropped:0 overruns:1150 frame:2
          TX packets:2716530 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:33135278971 (30.8 GiB)  TX bytes:227883477 (217.3 MiB)

eth1      Link encap:Ethernet  HWaddr 00:21:28:3D:6D:03  
          inet6 addr: fe80::221:28ff:fe3d:6d03/64 Scope:Link
          UP BROADCAST RUNNING SLAVE MULTICAST  MTU:1500  Metric:1
          RX packets:2766456 errors:0 dropped:0 overruns:0 frame:0
          TX packets:22803974 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:380681543 (363.0 MiB)  TX bytes:1720423086 (1.6 GiB)

eth2      Link encap:Ethernet  HWaddr 00:21:28:3D:6D:04  
          inet6 addr: fe80::221:28ff:fe3d:6d03/64 Scope:Link
          UP BROADCAST RUNNING SLAVE MULTICAST  MTU:1500  Metric:1
          RX packets:345621248 errors:444 dropped:0 overruns:444 frame:0
          TX packets:8492 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:495244880097 (461.2 GiB)  TX bytes:757968 (740.2 KiB)

eth3      Link encap:Ethernet  HWaddr 00:21:28:3D:6D:05  
          inet6 addr: fe80::221:28ff:fe3d:6d03/64 Scope:Link
          UP BROADCAST RUNNING SLAVE MULTICAST  MTU:1500  Metric:1
          RX packets:2755939 errors:444 dropped:0 overruns:444 frame:0
          TX packets:51 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:373782686 (356.4 MiB)  TX bytes:8334 (8.1 KiB)

lo        Link encap:Local Loopback  
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:65536  Metric:1
          RX packets:3512 errors:0 dropped:0 overruns:0 frame:0
          TX packets:3512 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:517649 (505.5 KiB)  TX bytes:517649 (505.5 KiB)

Start with the simplest problem - getting `ping` working between compute nodes and cslim-local. What does `traceroute cslim-local` show when run on one of the compute nodes? Are there any firewall restrictions on the compute nodes? Have you tried disabling the firewall there? — Paul Haldane, Apr 21 '17 at 10:46
traceroute is timing out, with 'host unreachable': [root@compute-0-0 ~]# traceroute cslim-local traceroute to cslim-local (10.1.1.11), 30 hops max, 60 byte packets 1 compute-0-0.local (10.1.255.254) 3000.757 ms !H 3000.755 ms !H 3000.752 ms !H [root@compute-0-0 ~]# traceroute picsl-local traceroute to picsl-local (10.1.1.16), 30 hops max, 60 byte packets 1 picsl-local.local (10.1.1.16) 0.212 ms 0.209 ms 0.204 ms — Michael S, Apr 21 '17 at 14:44
Yup that's timing out but it also tells us that the compute node expects to be able to reach the file server directly. Would be useful to see the output from `netstat -rn` and `ifconfig -a` on both machines (since a routing problem could be at either end). — Paul Haldane, Apr 21 '17 at 14:53
Sorry for the formatting mess in my comment above, I've moved new info to my original post instead. I'll add netstat and ifconfig output there too. — Michael S, Apr 21 '17 at 14:57

score 1 · Accepted Answer · answered Apr 23 '17 at 09:01

I think the problem is the netmask on the file server's 10.x interface. Here's my understanding of the current setup ...

|   machine   |      IP      |    netmask    | cidr |
|-------------|--------------|---------------|------|
| compute-0-0 | 10.1.255.254 |   255.255.0.0 | /16  |
| picsl       |    10.1.1.16 |   255.255.0.0 | /16  |
| cslim       |    10.1.1.11 | 255.255.255.0 | /24  |

This means that compute-0-0 and picsl both think that they can reach cslim directly but cslim can only reach picsl directly and needs to go through a gateway to reach compute-0-0. That's probably not what you expect and won't work.

Based on the information I've seen so far the fix is to change the netmask for the file server's 10.x interface (bond0) to be 255.255.0.0. However there may be reasons for the current setup so check with local network team if you have one.

Yes! That was it. I'm the local network team for the scope of this cluster, so noone to blame/ask but myself. Thanks very much for the help, Paul. — Michael S, Apr 24 '17 at 16:48

File server unavailable to part of cluster

1 Answers1