0

So we have some dell blades and chassis (blades are M600's, chassis M1000's) and other systems (R710 with MD3000 array). The R710 exports a source tree via nfs for the blades to build and test with.

The problem is the blades loose the nfs mounts. Blades in the same chassis, with what seem like identical configurations, have their connections hang, they cannot even ping the server. They eventually come back.

It is mostly Dell, in fact, we have a cable running from the r710 to a switch in one of the chassis, and another to a switch and from there to the chassis, both can have issues.

We are running Centos5 or Fedora Core release 5 (Bordeaux). The nfs server is running CentOS release 5.4 (Final).

Any thoughts? troubleshooting tips?

These are all to the same host, but via different routes:

Through a switch

[root@b053 ~]# ping svnwatch-data
PING storage.rack1.rinera.int (10.1.1.54) 56(84) bytes of data.

--- storage.rack1.rinera.int ping statistics ---
9 packets transmitted, 0 received, 100% packet loss, time 7999ms

Routed through another host:

[root@b053 ~]# ping svnwatch-data2
PING storage2.rack1.rinera.int (172.16.100.25) 56(84) bytes of data.
64 bytes from 172.16.100.25: icmp_seq=1 ttl=64 time=0.260 ms
64 bytes from 172.16.100.25: icmp_seq=2 ttl=64 time=0.217 ms
64 bytes from 172.16.100.25: icmp_seq=3 ttl=64 time=0.201 ms
64 bytes from 172.16.100.25: icmp_seq=4 ttl=64 time=0.264 ms

--- storage2.rack1.rinera.int ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 2999ms
rtt min/avg/max/mdev = 0.201/0.235/0.264/0.031 ms

With the host connected to a different chassis's switch (they are daisy chained)

[root@b053 ~]# ping svnwatch-data-eth2
PING svnwatch-data-eth2.rack1.rinera.int (10.1.1.56) 56(84) bytes of data.
64 bytes from 10.1.1.56: icmp_seq=1 ttl=64 time=0.598 ms
64 bytes from 10.1.1.56: icmp_seq=2 ttl=64 time=0.096 ms
64 bytes from 10.1.1.56: icmp_seq=3 ttl=64 time=0.168 ms

--- svnwatch-data-eth2.rack1.rinera.int ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2000ms
rtt min/avg/max/mdev = 0.096/0.287/0.598/0.222 ms
[root@b053 ~]#
TheLQ
  • 983
  • 4
  • 14
  • 31
Ronald Pottol
  • 1,703
  • 1
  • 11
  • 19
  • It would be interesting if you do continously monitoring – 3molo May 07 '11 at 04:31
  • Well, it can be done. Better people than I have been beating their heads against it. It is possible that there is something with some horrible, weird, wrong config, while there have been some sharp people before the current group, someones (who no longer works here) had the wifi handing out 1.x addresses (yes, a publicly routable class A), and the internal network uses .int (no, we are not an international treaty organization). – Ronald Pottol May 07 '11 at 05:10
  • I would take a close look at modules responsible for the network cards in blades. Second suspect would be bonding configuration on the blades (if there's any). – Paweł Brodacki May 07 '11 at 05:41
  • Oh, and it worked fairly well for years, only in the past month or so has it gone to hell. – Ronald Pottol May 07 '11 at 06:40
  • We never did track it down, migrated the files to another server, and it has been fine for almost a week now, vs several machines a day having their mounts hang. We changed from a 64 bit to a 32 bit Linux (one Fedora Core, the other CentOS, similar old vintage), 16 to 4GB ram, and a much nicer better configured disk array. – Ronald Pottol Jul 14 '11 at 20:34

1 Answers1

1

Here is what I would check.

  • route tables: ip route show
  • route cache: ip route show cache
  • check for any wierd iptables rules. iptables -t nat -L -n -v; iptables -L -n -v; iptables -t mangle -L -n -v
  • check log files.
  • check kernel version.
  • check sysctl/proc settings such as rp_filter, which important in routed / multi-interface configuration
  • check ARP tables for IP conflicts etc.
  • and of course: tcpdump and tcpflow...
Wim Kerkhoff
  • 901
  • 1
  • 5
  • 12