0

We have a client server setup, where the client sets up an SSH tunnel and uses port forwarding to send data to the server:

ssh -N -L 5000:localhost:5500 user@serveraddress

The normal number of SSH connections at the server is ~150, and while everything is normal, the server software processes incoming connections pretty fast (a few seconds at most).

However, recently we have noticed that the number of SSH connections rises to 900+. At this point, the server software sees connects to it and accepts these connections, but no data is coming in.

Has anyone seen such symptoms with SSH before? Any ideas on what the issue could be?

Server OS: Red Hat Linux 5.5
Firewall: Disabled
Key Exchange: Tested

EDIT: Adding parts of log data from /var/log/secure on the server side

There seems to be a lot of the following in the log file.

Apr 10 00:07:33 myserver sshd[15038]: fatal: Write failed: Connection timed out
Apr 10 00:12:01 myserver sshd[5259]: fatal: Read from socket failed: Connection reset by peer
Apr 10 00:44:48 myserver sshd[17026]: fatal: Write failed: No route to host
Apr 10 02:09:16 myserver sshd[10398]: fatal: Read from socket failed: Connection reset by peer
Apr 10 02:22:47 myserver sshd[24581]: fatal: Read from socket failed: Connection reset by peer
Apr 10 03:05:57 myserver sshd[12003]: fatal: Read from socket failed: Connection reset by peer
Apr 10 03:23:19 myserver sshd[22421]: fatal: Write failed: Connection timed out
Apr 10 08:13:43 myserver sshd[31993]: fatal: Read from socket failed: Connection reset by peer
Apr 10 08:36:39 myserver sshd[7759]: fatal: Read from socket failed: Connection reset by peer
Apr 10 09:02:32 myserver sshd[12470]: fatal: Write failed: Broken pipe
Apr 10 12:08:05 myserver sshd[728]: fatal: Write failed: Connection reset by peer
Apr 10 12:35:53 myserver sshd[6184]: fatal: Read from socket failed: Connection reset by peer
Apr 10 12:43:14 myserver sshd[2663]: fatal: Write failed: Connection timed out

NOTE: After about 10-15 minutes of the 900+ connections, the system will recover by itself - the number of connections will drop to a normal range, and the server will start getting data again. It sounds like a DOS/DDOS, but this is on an internal network.

ADDENDUM: Checked the connections status based on @kranteg's question. We just had another outage, and these are the results based on a script I wrote for all incoming SSH connections:

===                                                        
Tue Apr 15 12:22:07 EDT 2014 -> Total SSH connections: 996 
===                                                        
0 SYN_SENT                                             
1 SYN_RECV                                             
0 FIN_WAIT1                                            
0 FIN_WAIT2                                            
15 TIME_WAIT                                            
0 CLOSED                                               
760 CLOSE_WAIT                                           
143 ESTABLISHED                                          
77 LAST_ACK                                             
0 LISTEN                                               
0 CLOSING                                              
0 UNKNOWN                                              
===                                                        
===
Tue Apr 15 12:22:17 EDT 2014 -> Total SSH connections: 977
===
0 SYN_SENT
2 SYN_RECV
1 FIN_WAIT1
0 FIN_WAIT2
15 TIME_WAIT
0 CLOSED
756 CLOSE_WAIT
127 ESTABLISHED
76 LAST_ACK
0 LISTEN
0 CLOSING
0 UNKNOWN
===
===
Tue Apr 15 12:22:26 EDT 2014 -> Total SSH connections: 979
===
0 SYN_SENT
2 SYN_RECV
1 FIN_WAIT1
0 FIN_WAIT2
12 TIME_WAIT
0 CLOSED
739 CLOSE_WAIT
148 ESTABLISHED
77 LAST_ACK
0 LISTEN
0 CLOSING
0 UNKNOWN
===

It looks like there is a jump in the number of connections in CLOSE_WAIT. During "normal" operation, the number in CLOSE_WAIT is either 0 or very close to it.

Sagar
  • 534
  • 3
  • 7
  • 21
  • Please let me know if you need any more information, or if this belongs on another site (stackoverflow, for example) – Sagar Apr 10 '14 at 14:58
  • Do your logs have any relevant information ? – user9517 Apr 10 '14 at 15:11
  • The server software logs show nothing except the incoming connections and them being accepted. /var/log/messages shows nothing unusual. I've asked the server owner to check/send me the /var/log/secure log to check for SSH issues. – Sagar Apr 10 '14 at 15:40
  • Maybe not relevant, but I had a similar problem when i was programatically creating tunnels from a perl script. The issue was that some tunnels failed to shut down properly, so i rewrote it to use `IPC::Run` as per this question: http://stackoverflow.com/questions/13668235/capturing-output-from-a-subscript-with-a-timeout – Jarmund Apr 10 '14 at 20:27
  • Did you look at the client side ? I'm thinking that if you have a network error, client is creating a new ssh connection. Did you check that the old connection is closed ? If not the new one will not be able to bind the local port because it's already in use. I already got this kind of troubles with ssh tunneling. It seems that it's this kind of problem because you have 900 ssh connections which is 6×150. You should have 6 ssh connections per client due to 6 network errors. – kranteg Apr 11 '14 at 12:15
  • @Jarmund thanks! I will have to look into this. For us, it is a C application that is calling ssh to create the tunnel – Sagar Apr 11 '14 at 19:26
  • @kranteg the number is not always 900. It varies. Sometimes it's 700+, and at least once it went up to 1100+. However, after about 10-15 minutes it recovered by itself. Also, if the client side is trying to forward the same port twice, it will fail the second time - this is not the case. I will have to look into the closing of the connections, though. Thanks! – Sagar Apr 11 '14 at 19:28
  • Maybe running out of memory? That would explain that sockets get established (as handled by kernel) but application not working. – LatinSuD Apr 11 '14 at 21:33
  • Can you check two things with netstat : source of ssh connections (are they from known clients ?), state of ssh connections (ESTABLISHED, TIME_OUT, CLOSED_WAIT ?). Your last edit show errors like network errors. – kranteg Apr 14 '14 at 13:44
  • @kranteg I shall check that asap and let you know. I'll edit my question with the results. – Sagar Apr 14 '14 at 14:11
  • @kranteg just had an "outage" and I've recorded and posted the answer to your question. – Sagar Apr 15 '14 at 15:58
  • Close_wait is often due to network troubles (like a gateway temporary unreachable). You told us that you are on an internal network but. Are your clients on the same network as the server or do they use differents networks/gateways ? I ask this because in your sshd logs we can see a message like "no route to host". – kranteg Apr 16 '14 at 11:44
  • @kranteg The clients are on various subnets, 10.X.Y.100. The X and Y range between 1 and 100: therefore 10.1.1.100 through 10.100.100.100. We've been wondering the same thing, but are out of ideas on how to prove it. – Sagar Apr 16 '14 at 13:20
  • How many gateways ave you got on your server ? If you have more than one gateway, you have to find which one is failing. You can use netstat to find which network has trouble (one with CLOSE_WAIT flag). Did you ever get one of your ssh connection (admin one) hanging or freezing ? – kranteg Apr 16 '14 at 13:35
  • @Sagar Have you fixed the problem? Could you follow up with an answer to your issue? – randunel Aug 27 '14 at 19:47
  • @randunel we only recently found a "solution", and even then I don't know if it is the right thing, but I'll post it below. – Sagar Aug 29 '14 at 14:04

2 Answers2

1

I don't know if this is the correct solution, but it worked for us. Hopefully it will at least point others in the right direction, even if it doesn't solve it completely.

We noticed that every time we had an outage, the processor usage was near 100%. This, in turn, was because of another application batch processing certain files, and using up most of the CPU. We turned this process off, and have not had an outage so far. I honestly don't know if this is the root cause, but it has helped us. Not a single outage since then.

Sagar
  • 534
  • 3
  • 7
  • 21
0

It sounds like your client application initiating the tunnels may not be closing the connections properly after finishing it's write operation.

kalikid021
  • 387
  • 2
  • 3
  • We checked this. The clients creating the tunnels should be closing the connections based on the code we reviewed. – Sagar Apr 15 '14 at 15:59