36

Follow-Up: It looks like the rapid series of disconnects coinciding with a few months of running each server is probably coincidental and just served to reveal the actual problem. The reason it failed to reconnect is almost certainly due to the AliveInterval values (kasperd's answer). Using the ExitOnForwardFailure option should allow the timeout to occur properly before reconnecting, which should solve the problem in most cases. MadHatter's suggestion (the kill script) is probably the best way to make sure that the tunnel can reconnect even if everything else fails.

I have a server (A) behind a firewall that initiates a reverse tunnel on several ports to a small DigitalOcean VPS (B) so I can connect to A via B's IP address. The tunnel has been working consistently for about 3 months, but has suddenly failed four times in the last 24 hours. The same thing happened a while back on another VPS provider - months of perfect operation, then suddenly multiple rapid failures.

I have a script on machine A that automatically executes the tunnel command (ssh -R *:X:localhost:X address_of_B for each port X) but when it executes, it says Warning: remote port forwarding failed for listen port X.

Going into the sshd /var/log/secure on the server shows these errors:

bind: Address already in use
error: bind: Address already in use
error: channel_setup_fwd_listener: cannot listen to port: X

Solving requires rebooting the VPS. Until then, all attempts to reconnect give the "remote port forwarding failed" message and will not work. It's now to the point where the tunnel only lasts about 4 hours before stopping.

Nothing has changed on the VPS, and it is a single-use, single user machine that only serves as the reverse tunnel endpoint. It's running OpenSSH_5.3p1 on CentOS 6.5. It seems that sshd is not closing the ports on its end when the connection is lost. I'm at a loss to explain why, or why it would suddenly happen now after months of nearly perfect operation.

To clarify, I first need to figure out why sshd refuses to listen on the ports after the tunnel fails, which seems to be caused by sshd leaving the ports open and never closing them. That seems to be the main problem. I'm just not sure what would cause it to behave this way after months of behaving as I expect (i.e. closing the ports right away and allowing the script to reconnect).

jstm88
  • 757
  • 2
  • 9
  • 21
  • What is your question? How to address the port binding error, or how to find out why ssh is dying, or something else again? – MadHatter May 15 '14 at 14:40
  • I need to figure out why sshd is refusing to open the ports on the VPS (the bind error). The port binding error seems to be the root of the problem, and everything should work if I'm able to solve that. – jstm88 May 15 '14 at 14:47
  • 2
    For any late lurkers, instead of manually creating a script to keep the connection open, simply use autossh instead, which does this for you. http://serverfault.com/questions/598210/prevent-closing-of-ssh-local-port-forwarding/598300#598300 – oligofren May 24 '14 at 11:05

5 Answers5

37

I agree with MadHatter, that it is likely to be port forwardings from defunct ssh connections. Even if your current problem turns out to be something else, you can expect to run into such defunct ssh connections sooner or later.

There are three ways such defunct connections can happen:

  • One of the two endpoints got rebooted while the other end of the connection was completely idle.
  • One of the two endpoints closed the connection, but at the time where the connection was closed, there was a temporary outage on the connection. The outage lasted for a few minutes after the connection was closed, and thus the other end never learned about the closed connection.
  • The connection is still completely functional at both endpoints of the ssh connection, but somebody has put a stateful device somewhere between them, which timed out the connection due to idleness. This stateful device would be either a NAT or a firewall, the firewall you already mentioned is a prime suspect.

Figuring out which of the above three is happening is not highly important, because there is a method, which will address all three. That is the use of keepalive messages.

You should look into the ClientAliveInterval keyword for sshd_config and the ServerAliveInterval interval for ssh_config or ~/.ssh/config.

Running the ssh command in a loop can work fine. It is a good idea to insert a sleep in the loop as well such that you don't end up flooding the server when the connection for some reason fails.

If the client reconnect before the connection has terminated on the server, you can end up in a situation where the new ssh connection is live, but has no port forwardings. In order to avoid that, you need to use the ExitOnForwardFailure keyword on the client side.

kasperd
  • 30,455
  • 17
  • 76
  • 124
  • I'm thinking this may be the problem. In particular, my script on A will try to reconnect to B if the ssh process dies (of course since the warning message doesn't kill the ssh process it just hangs when this happens, but that's a problem for another day). But if A tries to reconnect to B too quickly, B may be waiting for A to reconnect. I probably need to make sure B always times out before A reconnects. Combining that with MadHatter's suggestion of killing the sshd processes before reconnecting will probably cover 95% of possible cases. – jstm88 May 15 '14 at 15:29
  • 2
    And speaking of the warning message not killing SSH, that got me thinking... and looking at manpages. Turns out `-o ExitOnForwardFailure yes` is exactly what I needed. So that's one less thing I need to figure out. To think, I was going to write a Python script to parse for those warning messages. This is a lot simpler. :D – jstm88 May 15 '14 at 15:34
  • Sorry for forgetting about `ExitOnForwardFailure` when writing my answer. I have added it to the answer now. – kasperd May 15 '14 at 15:40
  • 5
    No problem, and it was actually `-o ExitOnForwardFailure=yes` (note the equal sign). So if anyone comes across this, don't copy and paste from my previous comment, it won't work. :P – jstm88 May 15 '14 at 15:42
  • So I've been monitoring the server for about 10 hours and it looks like it's running fine; I'm assuming at this point that this answer is correct (I'm about 99% sure based on what I've seen) and that the series of rapid disconnects was coincidence related to network issues that just happened to appear a few months after starting each service. Thanks to everyone for your help. ;) – jstm88 May 15 '14 at 16:41
  • Could you expand this answer. I have the same problem. None of the words such as ClientAliveInterval occur in the sshd_config file. From your answer I have no idea what I should do to solve the issue. – Kvothe Feb 03 '21 at 18:08
5

For me when a ssh tunnel disconnects it takes awhile for the connection to reset so the ssh process continues to block leaving me with no active tunnels and I don't know why. A workaround solution is to put ssh into the background with -f and to spawn new connections without waiting for old connections to reset. The -o ExitOnForwardFailure=yes can be used to limt the number of new processes. The -o ServerAliveInterval=60 improves the reliability of your current connection.

You can repeat the ssh command frequently, say, in a cron, or, in a loop in your script, for example, in the following, we run the ssh command every 3 minutes:

while (1)
do
    ssh -f user@hostname -Rport:host:hostport -N -o ExitOnForwardFailure=yes -o ServerAliveInterval=60
    sleep 180
done
Stephen Quan
  • 161
  • 1
  • 4
4

You can find the process that's binding the port on that server with

sudo netstat -apn|grep -w X

It seems very likely to be the half-defunct sshd, but why make assumptions when you can have data? It's also a good way for a script to find a PID to send signal 9 to before trying to bring the tunnel up again.

MadHatter
  • 79,770
  • 20
  • 184
  • 232
  • I remember checking that on the previous VPS provider, and I confirmed that sshd was the process listening to those ports. Next time it happens I'll check it here, but as the behavior and setup are exactly the same I don't expect it to be any different. – jstm88 May 15 '14 at 14:57
  • Great, so have your script that reopens the tunnel kill the old tunneller before trying to do so. – MadHatter May 15 '14 at 15:15
  • There's never more than one tunnel script (on A) running at once, if that's what you're saying. On the other hand, if you mean to have the script remotely execute a command on B to kill the stray processes... that's actually not a half bad idea. But one concern is repeatedly killing off all SSH connections if I'm trying to debug. If the script on A is always killing B due to a glitch, then I can't be constantly being kicked off of B by the rogue A script. :P I'll have to test to make sure it doesn't do that. But like I said, not a half bad idea. ;) – jstm88 May 15 '14 at 15:25
  • I hadn't thought there was. You say there's a script running on the remote server that tries to bring up a tunnel and fails, because of the bind error, and I'm assuming it only runs when you need it to (ie, when the existing tunnel is no good) because you haven't said otherwise. All I'm suggesting is that it kills off the specific process that's holding the port open before it tries to bring up the new tunnel. – MadHatter May 15 '14 at 15:35
  • The script running ssh is only on server A, server B is a plain vanilla server with no extra scripts. What I'll probably do is write a kill script to put on server B, then remotely call it from A if it fails to connect a certain number of times in a row. That way it's less likely to interfere with other SSH connections. And I'll probably have the kill script log each time it's run and exit without doing anything if it's called too many times too quickly. Personally, it seems like rate-limiting any script that kills sshd is probably prudent. :P – jstm88 May 15 '14 at 15:46
  • That all seems very reasonable. At any rate, I think you've got some very good suggestions in the answers (most of them better than mine!). – MadHatter May 15 '14 at 16:00
  • Eh, it was a team effort. ;) I'll probably accept kasperd's answer since it more directly solves the original problem, but your suggestion is really good and might avoid a lot of problems in the long run. It's times like these when I find myself wishing there was a way to accept multiple answers. – jstm88 May 15 '14 at 16:19
  • grin - why thank you, and I think you're right to accept kasperd's answer. – MadHatter May 15 '14 at 16:35
2

Judging by your description the other answers here are more useful.
But for others that notice problems with port forwarding to a specific server and end up on this page:

In many cases the problem is that the ssh-server is configured to block port forwarding. To change this:

  • Open the sshd-configuration file (/etc/ssh/sshd_config on most systems)
  • If there is a line AllowTcpForwarding no, remove it.
    Depending on your version of openssh you will also explicitly have to add AllowTcpForwarding yes
    (When in doubt: Do it)
  • Restart the ssh-server
Garo
  • 181
  • 1
  • 1
  • 6
0

In my experience ssh has a slightly irksome habit of not exiting cleanly if 'something' is still running on the remote system. E.g. started in the background. You can reproduce this by:

ssh <server>
while true; do  sleep 60; done&
exit

Your ssh will log out, but won't actually close the session - until the remote process exits (which it won't, because it's a 'while true' loop). It may be something similar is happening - your session has a 'stuck' process that's being spawned by ssh. The port remains in use, and therefore it cannot be re-used by your local process.

Sobrique
  • 3,747
  • 2
  • 15
  • 36
  • The complete SSH command that executes on the A machine is `ssh -o ConnectTimeout=10 -o BatchMode=yes -gnN -R *:X:localhost:X root@$TUNSRV 1>>tunnel.log 2>&1 &` so there's nothing being executed by SSH except the tunnel itself, specifically due to the -N option. Whatever is being kept open is being done on the remote server B using sshd itself. – jstm88 May 15 '14 at 14:55