6

I'm getting some bizarre behavior with Ubuntu Server 10.04 64bit on two of our new servers (both fresh installs). I have ubuntu server (same version) deployed on 4-5 other servers without this issue.

Initially I cannot ssh into a fresh server install until I manually set the address that the ssh server is listening on in /etc/ssh/sshd_config. Once I've connected, I seem to be kicked out at random intervals with the following error:

Write failed: Broken pipe

Using "ssh -vv" doesn't show any other information. When I'm kicked out in this manner, I cannot reconnect for another seemingly random period of time. Sometimes a few seconds, others a few minutes. If I run "netstat -nap|grep :22", I can see that my connection still exists after the write failed error. I can't seem to re-connect until that connection drops.

After one of these errors, if I hop onto the server from the console, ssh into another machine, and then attempt to ssh back into the server, everything works fine.

Using "-o TCPKeepAlive=yes" client side doesn't seem to effect anything. I've disabled both iptables and ufw on the server. AppArmor is not showing any enforced profiles and SELinux isn't installed.

My logs aren't reporting any errors and I don't have any custom configs. This is a box-stock install. Note that when I try to get back in after the broken pipe error, this is the error I get:

ssh: connect to host 172.22.50.92 port 22: Connection refused

And nmap no longer shows port 22 as being open, though netstat on the server says it's still listening on port 22.

EDIT - I'm not sure if it means anything, but I've installed KVM on these hosts and I can ssh into the guests (ubuntu server 64bit as well) without any issue.

UPDATE - I've tried purging openssh and re-installing with apt. I've also purged and installed openssh from source with no luck. traceroutes and pings overnight show no packet loss whatsoever.

YET ANOTHER UPDATE - Dell seems to think that we've got a bad motherboard in the server. Having that replaced to see if it resolves the issue.

cmhobbs
  • 267
  • 1
  • 3
  • 12
  • Try to include the below options in your /root/.ssh/config file on client side. Host hostnameofthesever User root Hostname ipoftheseerver ServerAliveInterval 240 ServerAliveCountMax 4 It might help.. – Ramesh Kumar Nov 18 '10 at 14:49
  • I'm not using ssh as root. Will this apply for my local user account? – cmhobbs Nov 18 '10 at 16:56
  • Check the secure log file in /var/log/secure, It may give some clue. – Sri Mar 23 '11 at 13:53

4 Answers4

4

Use mtr to check the network. Try a command like mtr -i 15 remotehost. Leave this running in a window, or use screen so you can detach. It should catch any problems with the network. Packet loss is typically 0% on most of my systems.

EDIT: What does the output of arp -n show for your IP address before and after ssh drops. You may want to try this on another server on the same subnet. There should be only one HW address for the IP address and it should not change. If it does you have an IP address conflict.

BillThor
  • 27,737
  • 3
  • 37
  • 69
3

This post resolved the issue: massive packet loss when servers are brought online

cmhobbs
  • 267
  • 1
  • 3
  • 12
2

Ok.. sooo from what i can assume from glancing at this...

your basically getting extended drop outs..

1.) You have a bad network connection..

2.) The network the server is on, has a bad network connection / bad router / bad something :P

3.) Your servers have conflicting addresses / problem hardware.

My solution..

Run a ping overnight.. and see how many packets you lose in the morning :D (just to see if i was heading in the right direction )

Hope this helps..

Arenstar
  • 3,602
  • 2
  • 25
  • 34
  • 1
    I've already attempted this. 0% packet loss and the other functioning servers are on the same network (same switches, even). – cmhobbs Nov 18 '10 at 14:57
  • boom.. im stumped... any complaints in dmesg??? on the problem servers? – Arenstar Nov 18 '10 at 15:00
  • unless you changed something in sshd.conf ( the standard install config would not give you these problems :/ ) Are they brand new machines?? – Arenstar Nov 18 '10 at 15:01
  • Just got them a week ago, fresh install on new machines. The other ones that work are new machines as well. That's the real head scratcher here: these two boxes are the same vendor (not same model) as the other machines, and these are different models from one another, so the I don't think it's a NIC issue. I think failing hardware would complain in syslog somewhere. – cmhobbs Nov 18 '10 at 15:17
  • Are you running an ipmi/LOM type of management of the same port? – Arenstar Nov 18 '10 at 17:11
  • None that I'm aware of, unless it shipped that way. They're all Dells. I'll dig through the BIOS. – cmhobbs Nov 18 '10 at 17:13
  • Additionally, DRAC isn't configured. – cmhobbs Nov 18 '10 at 17:13
1

You can get flakey connections with certain NIC/switch combo's when autonegotiate is turned on, and it negotiates to half-duplex.

Use "ethtool eth0" to verify that the speed and duplex settings are correct, and to change them if you need to.

Bob
  • 940
  • 5
  • 7