SLES11 random unreachability from a single machine

Question

I'm experiencing a very weird problem and I am lost hard now. I have set up several Suse SLES 11 SP2 machines since my company tries switching to SUSE. And every single machine with SLES 11 has this specific issue:

Once installed, everything works fine. However if one connects from a Debian Machine (Squeeze and Wheezy) or Putty (latest version) via SSH, the connection occasionally drops and the machine stays unreachable from that machine only. If I then connect to it via some random server, it works, while I cannot even ping that machine from my machine.

More Details:

tcpdump on the machine sees my own ping attempts but no reply is being sent
SSH simply times out while that happens
restarting the network interface or rebooting resolves the issue temporarily
occurs randomly between 1 minute in and several hours
all machines are on the same subnet
all machines are connected to a cisco switch, no VLAN configured on this subnet
checked for IP theft (maybe a laptop sleeping and awaking randomly to do stuff), no success
to complete the mess, connections from a RedHat6 machine (exactly the same hardware) never experience this issue
the e1000e module is being used on all these machines (except for windows with putty of course), updating to the latest firmware on one or both sides did not help
Network Cables have also been switched - no success
the eeprom_fix_82574_or_82583 did not fix this issue, even though that issue was present on some of these machines
installing a Debian on these problematic machines resolves the issue, but is not wanted for company reasons...

So here I am, completely clueless... Does anyone have even the slightest idea what is wrong here?

Okay, I was wrong about one thing: my machine is NOT on the same subnet as the other machines. A buddy said he never had these problems - he is on the same subnet. My machine is in a different one, which should be able to connect without trouble. Traceroute reveals that from my machine to the server it takes one station to get there. From the server to my machine it takes two stations(!). So after some time the router sends a RIP package, suggesting a more direct route. My machine accepts it, the SUSE machines don't. How can I fix that? — , Feb 28 '13 at 09:35

score 0 · Answer 1 · answered Feb 21 '13 at 12:13

0

tcpdump on the machine sees my own ping attempts but no reply is being sent

In GNU/Linux, tcpdump can see packets blocked by local iptables firewall, so if you don't see any reply being sent, it's either due to inbound ICMP blocked, or lack of ARP entry of source.

To diagnose these issues, -e switch of tcpdump is helpful, as it prints out link-layer addresses, and while analysing traffic make sure to capture ARP traffic as well.

To rule out probability of ARP (easy), you can add static/permanent ARP entries to see if issue goes away. If after adding static ARP entries, it goes then probably someone is stealing ARPs or some ARP filtering is going on.

answered Feb 21 '13 at 12:13

abbe

356
1
12

Thanks. iptables is uninvolved, as it is empty and allowing everything. Stuff like apparmor or selinux are disabled. I have tried tcpdump -e and discovered TCP Retransmissions occuring exactly when the connection drops. After 5 Minutes or so, Ping started working again, SSH wont (get's stuck at debug2: channel 0: open confirm rwindow 0 rmax 32768). Also a lot of ARP requests and replies are piling up. Still working myself into this... – Feb 21 '13 at 13:47
Also lots of TCP Retransmissions when trying SSH now AFTER it stopped working. – Feb 21 '13 at 13:54
Did you try adding static ARP entries? Also look in switch's CAM table for the MACs of affected hosts, if switch is aware of them. – abbe Feb 21 '13 at 14:57
I think the issue is wrong, as I've overlooked a routing issue factor, described as added info above... my bad. still it helped me getting these lost packages, which led to that error possiblity. – Feb 28 '13 at 09:37
Are you sure that's a `RIP` packet, and not some [ICMP Redirect](http://en.wikipedia.org/wiki/Internet_Control_Message_Protocol#Redirect) packet ? There are couple of [`icmp` related sysctl knobs](http://lxr.linux.no/linux+v3.8.1/Documentation/networking/ip-sysctl.txt#L816) in Linux which you might like to play with. – abbe Feb 28 '13 at 15:55
I've tracked down random connection loss problems with SLES 11 machines to a network issue definitely related to SLES 11 only. The Issue: Connections to SLES 11 Servers using SSH from a different subnet stop working at random. Even Ping fails from my machine. A different machine in my subnet however CAN ping while I cannot. The issue does **not** happen when the machine connecting is on the same subnet as the SLES server. – Mar 04 '13 at 14:36
Because of Packet Loss (found using tcpdump) exactly the moment the connection happens, our net admin suggested SLES 11 probably doesn't handle RIP packets correctly by default. Indeed RedHat, Debian, Windows etc. do not appear to have this issue in the very same setup. – Mar 04 '13 at 14:36
Routing Infos while it works: http://pastebin.com/HkjsaG45 – Mar 04 '13 at 14:37
Okay, this is DEFINITELY a RIP issue. Made an Update to the pastebin info. http://pastebin.com/mLWFJ1FJ – Mar 05 '13 at 10:49

score 0 · Accepted Answer · answered Mar 05 '13 at 12:34

Thanks to abbe, I've found a solution to this problem:

Simply. Disable. Iptables.

Completely, that is, by disabling the kernel modules from loading at boot time as well as disabling the firewall beforehand in Yast.

Create this file and reboot, then check with lsmod if iptables is still loaded:

nano /etc/modprobe.d/netfilter.conf

alias ip_tables off
alias iptable off
alias iptable_nat off
alias iptable_filter off
alias x_tables off
alias nf_nat off
alias nf_conntrack_ipv4 off
alias nf_conntrack off

Afterwards, the routing issue resolved itself with the unnecessary routing point in between already removed, so it is now just a single hop instead of two right from the start.

Source: http://backstage.soundcloud.com/2012/08/shoot-yourself-in-the-foot-with-iptables-and-kmod-auto-loading/

From looking at your pastes, it seems like your routing is messed up. Disabling `iptables` is just causing it to see ICMP redirect messages which were probably blocked due to firewall. For `RIP`, you need a RIP daemon (to update local routing table based on updates from RIP), which seems unlikely. A solution would be to audit/fix both forward/reverse routing between hosts to avoid relying on ICMP redirect messages. — abbe, Mar 05 '13 at 15:18

SLES11 random unreachability from a single machine

2 Answers2