SSH connection with key is unreliable

Question

Im an intern sysadmin for a small company. There's no real sysadmin in the place for me to ask when i encounter problems. Thanks for the help

The company uses Nagios to monitor their web server. They use connect_by_ssh to do so with public and private keys. The problem is that sometime it works, sometime it doesnt. Someone can always log in using name and password. its just the keys that dont always work.

Some log for you :

Jan 16 13:23:10 localhost nagios3: SERVICE ALERT:
Server02;SSH;CRITICAL;SOFT;1;Connection timed out

Jan 16 13:24:10 localhost nagios3: SERVICE ALERT:
Server02;SSH;CRITICAL;SOFT;2;Connection timed out

Jan 16 13:24:50 localhost nagios3: SERVICE ALERT:
Server02;SSH;OK;SOFT;3;SSH OK - OpenSSH_5.3 (protocol 2.0)

Jan 16 14:15:10 localhost nagios3: SERVICE ALERT:
Server02;SSH;CRITICAL;SOFT;1;Connection timed out

Jan 16 14:15:50 localhost nagios3: SERVICE ALERT:
Server02;SSH;OK;SOFT;2;SSH OK - OpenSSH_5.3 (protocol 2.0)

Just to be sure, even if the ssh works with user/password

nmap server02.8p-hosting.com

Starting Nmap 5.00 ( http://nmap.org ) at 2014-01-16 14:16 EST
Interesting ports on abc.domain.com (xxx.xxx.xxx.xxx):
Not shown: 971 closed ports
PORT     STATE    SERVICE
22/tcp   open     ssh

Heres how it looks in a regular week :

ssh this week

What could it be?

Log/Debug

ssh -vvv root@abc.domain.com OpenSSH_5.5p1 Debian-6+squeeze4, OpenSSL 0.9.8o 01 Jun 2010 debug1: Reading configuration data /etc/ssh/ssh_config debug1: Applying options for * debug2: ssh_connect: needpriv 0 debug1: Connecting to abc.domain.com [xxx.xxx.xxx.xxx] port 22. debug1: connect to address xxx.xxx.xxx.xxx port 22: Connection timed out ssh: connect to host abc.domain.com port 22: Connection timed out

It's hard to say without seeing the command it executes. Is there a reason they opted to use connect_by_ssh over NRPE? — CIA, Jan 16 '14 at 19:43
I actually prefer check_by_ssh for a lot of things. Has the advantage of being able to execute an event handler on the target system. — dmourati, Jan 16 '14 at 19:47
@CIA i asked my boss who's the one who installed it a while ago. He didn't even know he was using connect_by_ssh...but doesn't want me to change it. When i look at last year data for ssh, its down about 22% of the time. Here's what i got when running it in command line with -vvv option : — littleadmin, Jan 16 '14 at 19:47
ssh -vvv root@abc.domain.com OpenSSH_5.5p1 Debian-6+squeeze4, OpenSSL 0.9.8o 01 Jun 2010 debug1: Reading configuration data /etc/ssh/ssh_config debug1: Applying options for * debug2: ssh_connect: needpriv 0 debug1: Connecting to abc.domain.com [xxx.xxx.xxx.xxx] port 22. debug1: connect to address xxx.xxx.xxx.xxx port 22: Connection timed out ssh: connect to host abc.domain.com port 22: Connection timed out — littleadmin, Jan 16 '14 at 19:49
It's probably the web server ssh setup. By default ssh wants to do a reverse dns lookup. It can take a long time. — hookenz, Jan 16 '14 at 20:34
@Matt They didn't gave me access to the web server; I think they are afraid I'll put it on fire. But i'll ask my supervisor (after explaining to him why i want to do that) to let me see the web server ssh — littleadmin, Jan 16 '14 at 20:38
Or if you can add a reverse dns entry to your nagios server that may help — hookenz, Jan 16 '14 at 20:40
Slow connections to SSH are 99% of the time due to DNS reverse lookup which is on by default. — hookenz, Jan 16 '14 at 20:40
The log screenshot shows the issues showing up around normal business hours. This could indicate the server is having issues with I/O during normal business hours, and therefor causing the ssh to report a slow/dead connection. This is a common issue if the web server is also acting as the everything-server. — CIA, Jan 16 '14 at 20:56

rfelsburg · Answer 1 · 2014-01-16T20:28:39.697

0

Unfortunately it could be any number of things, first thing I'd do is turn up the ssh logging on the ssh server to 'DEBUG'.

Also, I'm assuming you mean that they're using check_ssh to monitor the ssh server on the boxes. Inside nagios, there are a couple of things you can do to see what command is being executed exactly. If you have ssh access to the nagios server, you can just login and look at the nagios services.cfg, to find exactly what nagios plugin is being called, with exactly which switches.

Then look at commands.cfg to see what executing. Then, try using that command to test the ssh server manually from the command line.

The other way is using nagios' interface. On the nav bar on the left, at the bottom is a configuration link. Click on it, then using the drop down, go to services, and find exactly what plugin is being called for that service. Next using the dropdown goto command expansion and get the command that way. Then manually check.

Lastly, look to see if SELinux is enabled, if so, the selinux context probably needs changed on the file. If you're using something like puppet or chef, it's possible it's fighting over the file being fixed then broken. Etc.

UPDATE:

I would try adding -E and/or -S to the check_by_ssh command. Sometimes weird ssh output can mess up the connection if it thinks it should be waiting. Also, adding in -v will give you an indication of what's going on.

edited Jan 16 '14 at 20:28

answered Jan 16 '14 at 19:43

rfelsburg

767
3
7

check_ssh /usr/lib/nagios/plugins/check_ssh -t 30 '$HOSTADDRESS$' Is the command being used. Since its working atm, i cant give you the output if i put it in command line. – littleadmin Jan 16 '14 at 20:00
What check exactly is failing? For what service? Are all nagios checks failing intermittently, or just a specific service check. – rfelsburg Jan 16 '14 at 20:17
Just the ssh connection. and the checks that need that connection to work - like check_remote_disk (since we are using connect_by_ssh instead or NRPE) – littleadmin Jan 16 '14 at 20:23
So it sounds like you have multiple services failing, and they're all using check_by_ssh, correct? I didn't think there was a connect_by_ssh Also, with check_by_ssh, the below has worked for me with intermittent failures in the past. Adding the -E to the check_by_ssh ignores the stderr output. It may be hanging on something. -S ignore all stdout and just returns data as well. Also, add the -v option to the check_by_ssh command. – rfelsburg Jan 16 '14 at 20:27
you're right, its check_by_ssh, my mistake, a bit under pressure here, lol. The -E is already there, i'll add -v and -S. Thanks :) – littleadmin Jan 16 '14 at 20:35
No problem, report back and let us know if it worked, and if not, we'll keep working on it :-) – rfelsburg Jan 16 '14 at 23:50
Well, after the ticket my boss sent to the company hosting our webserver, everything got back to normal...we are close to 24hours without any interruption. I sadly have no clue what they did on their side, so i cant help further with the resolution of that problem. Thanks a lot for the help – littleadmin Jan 17 '14 at 18:47

score 0 · Answer 2 · answered Jan 16 '14 at 19:46

0

This looks more like a timeout issue than anything to do with SSH itself.

Take a look at your nagios checks.

You probably want to add a -t option to check_by_ssh:

 -t, --timeout=INTEGER
    Seconds before connection times out (default: 10)

You should probably also check service_check_timeout in your nagios.cfg.

Mine is set to 60s.

http://nagios.sourceforge.net/docs/nagioscore/3/en/configmain.html

answered Jan 16 '14 at 19:46

dmourati

25,540
2
42
72

The timeout is 30 sec. with 4 try – littleadmin Jan 16 '14 at 20:07

score 0 · Answer 3 · answered Feb 15 '14 at 00:31

I've seen this before as a DNS issue.

Perhaps the rDNS lookup times out (as noted in comments above) or perhaps the server is actually several servers using round-robin DNS (multiple A records for one domain name) and only a subset of the servers is offline, not running SSH, or otherwise fails the test.

SSH connection with key is unreliable

3 Answers3