0

We are using the Apache Zookeeper Client C bindings in our application. Client library version is 3.5.1. When the Zookeeper connection gets disconnected, the application is configured to exit with error code 116.

Systemd is being used to automate starting/stopping the application. The unit file does not override the default setting for KillMode, which is to send SIGTERM to the application.

When the process is stopped using the systemctl stop directive, the Zookeeper client threads seem to be attempting to reconnect to Zookeeper:

2016-04-12 22:34:45,799:4506(0xf14f7b40):ZOO_ERROR@handle_socket_error_msg@2363: Socket [128.0.0.4:61758] zk retcode=-4, errno=112(Host is down): failed while receiving a server response
2016-04-12 22:34:45,799:4506(0xf14f7b40):ZOO_INFO@check_events@2345: initiated connection to server [128.0.0.4:61758]
Apr 12 22:34:45 main thread:   zookeeperWatcher: event type ZOO_SESSION_EVENT state ZOO_CONNECTING_STATE path
2016-04-12 22:34:45,801:4506(0xf14f7b40):ZOO_INFO@check_events@2397: session establishment complete on server [128.0.0.4:61758], sessionId=0x40000015b8d0077, negotiated timeout=20000
2016-04-12 22:34:46,476:4506(0xf14f7b40):ZOO_WARN@zookeeper_interest@2191: Delaying connection after exhaustively trying all servers [128.0.0.4:61758]
2016-04-12 22:34:46,810:4506(0xf14f7b40):ZOO_INFO@check_events@2345: initiated connection to server [128.0.0.4:61758]
2016-04-12 22:34:46,811:4506(0xf14f7b40):ZOO_ERROR@handle_socket_error_msg@2382: Socket [128.0.0.4:61758] zk retcode=-112, errno=116(Stale file handle): sessionId=0x40000015b8d0077 h

Due to this, the process is exiting with an error code. Systemd sees failure code upon exit and does not attempt to restart the application. Does anyone know why the client is getting disconnected?

I am aware that I can work around this by setting SuccessExitStatus=116 in the unit file, but I don't want to mask out genuine errors. I have tried registering a signal handler for SIGTERM and closing the Zookeeper client in the handler. But the handler code never seems to get hit when I issue systemctl stop.

EDIT: The handler wasn't getting called because I had made it asynchronous - it didn't execute immediately upon receiving signal. OTOH the process exits immediately upon Zookeeper disconnect.

Bug Killer
  • 661
  • 7
  • 22
  • This is why "no longer reproduced", OP in chat stated "I figured out what was happening in my situation, and it is not related to the description I gave. Basically someone in my org was running a script that was killing a connection, unknown to me", hence also last vote of OP – Petter Friberg Jun 18 '16 at 22:37

3 Answers3

0

What happens when you load the handler for SIGTERM and issue systemctrl stop? If nothing occurs then you may have a mask blocking the signal (I guess not). If the application keeps exiting with the same error code then I would suggest you make sure that the signal handler is being loaded correctly.

Nevado
  • 162
  • 7
0

This is working expected, it's the application writer's responsibility to specify how to gracefully shutdown the service, if you don't want to use the default, which sends SIGTERM, you can use the ExecStop to make your own stop command in the unit files:

ExecStart=/usr/bin/app
ExecStop=/usr/bin/app -stop

For details see docs at https://www.freedesktop.org/software/systemd/man/systemd.service.html#ExecStop=

fluter
  • 13,238
  • 8
  • 62
  • 100
  • I don't want to change the behavior. My question is, why is the Zookeeper client library causing this error code. This happens even with a handler registered – Bug Killer Apr 13 '16 at 00:37
  • 1
    The log you posted shows zookeeper lib detected a network failure while receiving, so it initiated a reconnection to the server, there is nothing wrong with that. It says nothing about signals. It's not the zookeeper library "causing" the error, it is handle the network failure as expected. – fluter Apr 13 '16 at 00:50
  • But this "network failure" is happening only when I issue a systemctl stop – Bug Killer Apr 13 '16 at 01:14
  • 1
    It could be the signal interrupted the recv routines and led the reconnection, but still, nothing wrong with zookeeper client, it works as expected, it's the application needs to take care of it. – fluter Apr 13 '16 at 01:16
  • There is another way, use SIGKILL in the unit file, so that it will kill your app entirely, or you could catch SIGTERM, and in it, call zookeeper shutdown so that zookeeper will be cleaned up. – fluter Apr 13 '16 at 01:18
  • I did exactly the latter - I caught SIGTERM and called zookeeper_close in my signal handler. I even verified that signal handler gets called and closes connection when I do `kill SIGTERM pid`. But it doesn't get called when I do `systemctl stop app`. Instead my code which exits the process on Zookeeper connection failure gets called – Bug Killer Apr 13 '16 at 01:26
  • Can you show the code please? – fluter Apr 13 '16 at 01:30
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/108986/discussion-between-fluter-and-bug-killer). – fluter Apr 13 '16 at 01:38
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/108988/discussion-between-bug-killer-and-fluter). – Bug Killer Apr 13 '16 at 02:01
0

The issue is unrelated, someone was running a script that was killing the connection. Thank you all for your help!

Bug Killer
  • 661
  • 7
  • 22