What code will recover from Removing (timedout) connection ?

Question

I have a gen_server on one system and 4 clients on 4 other systems. The code runs as expected for 3 or 4 days when the gen_server reports "** Removing (timedout) connection **". Because the clients can become active before of after the start of the gen_server, the clients execute this code prior to every call to the gen_server:

connect_IPDB() ->
% try every 5 sec to connect to the server
case net_kernel:connect_node(?SERVER) of
    % When connected wait an additional 5 seconds for stablilty
    true -> timer:sleep(5000);
    false -> 
        timer:sleep(5000),
        connect_IPDB()
end.

This works as anticipted, when bringing up the server or a client, in any order. They all connect and show up in nodes() when executed on the server.

Here is the problem. Sometime after the "** Removing (timedout) connection **" error, nodes() shows all of the nodes, implying that the client is not hung and has executed the above code. However communication with the timedout node has not resumed. How can I reestablish connection short of restarting the client? BTW, restarting the client does fix the issue.

Any help, appreciated.

This whole idea is sort of weird to me, but anyway... Try having the clients monitor one of the server processes once they connect and have it exit (crash) when it receives the `'DOWN'` message from that monitor.. Place the client under a supervisor that restarts it at the point of trying to connect as above. — zxq9, Aug 06 '15 at 15:25
Thanks, for the suggestion. Seems over the top complicated. So, not my first choice, but maybe my only one :-). — Bill Ott, Aug 06 '15 at 18:23
If you're writing Erlang this is *by far* the least complicated of all solutions. Erlang even has a standard behavior built around this (called [`supervisor`](http://www.erlang.org/doc/man/supervisor.html)). Telling it to manage your client is *the normal way things are done*. You can [write your own supervisor/manager](https://github.com/zxq9/erlmud/blob/master/erlmud-0.1/locman.erl) in raw Erlang by hand to get a grip on how things work, if you like, but using the supervisors OTP provides is drastically less code (but a bit of reading the first time around). — zxq9, Aug 06 '15 at 20:07
You are probably right. This was my first outing with supervisors and applications using rebar. I was amazed at how well it worked for the amount of code I wrote. This almost seems like a bug, or my error. When the node times out, will the node() command on the server show that it is no longer connected? If so, then my connect() routine is reconnecting the node but communication is still blocked. If not, then its working as intended and my routine is not really connecting, for reasons I don't understand. I kinda feel that using the supervisor to restart is ignoring the problem. — Bill Ott, Aug 06 '15 at 20:53

Bill Ott · Accepted Answer · 2015-08-11T22:18:15.713

I finnaly figured out the problem and solution. My time out was caused when the clients in question were paused (they were VMs) so they could be backed up. Because they were paused, when they were unpaused the supervisor in the clients did not see any issue, so would not restart the program.

The fix was to change the connect_IPMD to:

connect_IPDB() ->
% See if we are connected to the server. Is the server in the list?
case lists:filter(fun(X) -> string:str(atom_to_list(X),atom_to_list(?SERVER))== 1 end, nodes(connected)) of
    % If empty, then not in list, enter the reconnect loop
    [] -> 
        connect_IPDB("Reconnect");
    % any thing else, then we are connected, so proceed
    _ -> ok 
end.
connect_IPDB(_Reconnect) ->
case net_kernel:connect_node(?SERVER) of
    % When connected wait an additional 5 seconds for stablilty
    true -> 
        timer:sleep(5000),
        Ips = gen_server:call({global, ?SERVER},getall_ips),
        % Re-initialize the iptables
        removechain(),
        createchain(),
        % Load the Ips into the local iptables f2bchain
        load_ips(Ips),
        % restart the ntpd 
        os:cmd("service ntpd restart");
    false -> 
        timer:sleep(5000),
        connect_IPDB("Reconnect")
end.

This has the addtional advantage of reseting the client clock (restarting NTPD) when the client comes out of the pause.

I am leaving the supervisor in place to handle "real" failures verus this self induced one.

What code will recover from ** Removing (timedout) connection **?

1 Answers1

What code will recover from Removing (timedout) connection ?