2

I have a server application serving thousands of clients for basic TCP operations. It accepts both plain text and TLS connections.

Some of the most sollicitated servers are facing a strange issue since few months. After few hours or days, there are many sockets blocked in CLOSE_WAIT state till server stop serving requests due to the lack of file descriptors.

Sockets graphs here : https://ibb.co/WsvS1D9

enter image description here

We tried to add logging so we can see that sockets are blocked on the first read operation (for plain text sockets) or on the handshake (for TLS) sockets. When this happens, the async_read_until and async_handshake function never finish and the associated deadline_timer callback is never called. It's like the async method is blocking the thread execution!

Connection starts from acceptor

Log::writeLine("("+ StringUtils::printPointer(static_cast<void *>(this)) + ") [FD:"+
                       StringUtils::toString(socket().native_handle())+"] "+" Connection opened", Log::LEVEL_INFO);

/// Set a deadline for the read operation.
inout_deadline_.expires_from_now(boost::posix_time::seconds(TIMEOUT));

/// initiate calls to timer check
inout_deadline_.async_wait(
        strand_.wrap(boost::bind(&TcpServerConnection::check_deadline,
                                 shared_from_this(), boost::asio::placeholders::error, &inout_deadline_)));

Log::writeLine("("+ StringUtils::printPointer(static_cast<void *>(this)) + ") [FD:"+
               StringUtils::toString(socket().native_handle())+"] "+" async_wait initialized", Log::LEVEL_INFO);

if (useTLS_) {
    /// Start TLS handshake
    Log::writeLine("("+ StringUtils::printPointer(static_cast<void *>(this)) + ") [FD:"+
                   StringUtils::toString(socket().native_handle())+"] "+" async_handshake initialized", Log::LEVEL_INFO);

    socket_.async_handshake(boost::asio::ssl::stream_base::server,
                            strand_.wrap(
                                    boost::bind(&TcpServerConnection::handle_handshake, shared_from_this(),
                                                boost::asio::placeholders::error)));
} else {
    startReadLine();
}

In startReadLine() :

inout_deadline_.expires_from_now(boost::posix_time::seconds(TIMEOUT));

Log::writeLine("("+ StringUtils::printPointer(static_cast<void *>(this)) + ") [FD:"+
               StringUtils::toString(socket().native_handle())+"] "+" async_read_until initialized", Log::LEVEL_INFO);
if(useTLS_) {
    boost::asio::async_read_until(socket_, lineBuffer_, lineDelimiter_,
      strand_.wrap(
      boost::bind(&TcpServerConnection::handle_read_line, shared_from_this(),
              boost::asio::placeholders::error,
              boost::asio::placeholders::bytes_transferred,
              false)));
} else {
    boost::asio::async_read_until(socket_.next_layer(), lineBuffer_, lineDelimiter_,
      strand_.wrap(
      boost::bind(&TcpServerConnection::handle_read_line, shared_from_this(),
              boost::asio::placeholders::error,
              boost::asio::placeholders::bytes_transferred,
              false)));
}

Below are 2 examples of log we get for sockets in CLOSE_WAIT state :

PLAIN SOCKET

Jan  4 18:30:28 INFO: (0x7efcf84865b0) [FD:831] Incoming connection: SOCKET RBUF=262142 SBUF=262142
Jan  4 18:30:28 INFO: (0x7efcf84865b0) [FD:831]  Connection opened
Jan  4 18:30:28 INFO: (0x7efcf84865b0) [FD:831]  async_wait initialized
Jan  4 18:30:28 INFO: (0x7efcf84865b0) [FD:831]  startReadLine() called!
Jan  4 18:30:28 INFO: (0x7efcf84865b0) [FD:831]  async_read_until initialized

handle_read_line() and check_deadline() are never called

TLS SOCKET

Jan  4 17:33:06 INFO: (0x7efcfc14a7f0) [FD:513] Incoming connection: SOCKET RBUF=262142 SBUF=262142
Jan  4 17:33:06 INFO: (0x7efcfc14a7f0) [FD:513]  Connection opened
Jan  4 17:33:06 INFO: (0x7efcfc14a7f0) [FD:513]  async_wait initialized
Jan  4 17:33:06 INFO: (0x7efcfc14a7f0) [FD:513]  async_handshake initialized

handle_handshake() and check_deadline() are never called

We tried to update to last version of Boost (1.75) and OpenSSL (1.1.1i), nothing changed.

All help will be much appreciated :)

sehe
  • 374,641
  • 47
  • 450
  • 633
usnake38
  • 21
  • 2
  • I think this is a wellknown pattern that arises when a server just drops connections without informing the other end. This is not unusual in high-performance servers. However, when clients do it, servers may become clogged with waiting sockets. I'd investigate the client end to see whether they correctly `shutdown` (both the SSL and the socket) before closing. – sehe Jan 05 '21 at 14:31
  • /OT ["It's like the async method is blocking the thread execution!" - clearly you would notice this as a softlock/deadlock when the sockets pile up like this. So, they're not.] – sehe Jan 05 '21 at 14:31
  • The client is the browser (Firefox, Chrome, IE...). I assume there is no leak here... Regarding the deadlock, I'm not sure to understand your comment. Could you explain? The issue is really random; sometimes it take some hours to reproduce, sometimes it take some days while the server load is pretty the same. That is also really strange. – usnake38 Jan 13 '21 at 20:53
  • "I assume there is no leak here..." - the point is not whether they leak, but whether they skip shutdown sequence. I wouldn't be surprised if they do. And perhaps the server is expected to proactively shutdown as well (heeding the `Connection: Close` header and such application protocol specific signals?) – sehe Jan 13 '21 at 20:59
  • The client generally use keep-alive, we are also working with websockets. Socket disconnection is made on both sides. If there is an issue with socket closing, why this happens after days of working fine? – usnake38 Jan 14 '21 at 06:49
  • That's the right question to ask. I'm afraid we don't have any of the necessary context to come up with ideas. (PS. I read "After few hours or days, there are many sockets blocked in CLOSE_WAIT state till server stop serving requests" to mean that it always leaves the sockets in CLOSE_WAIT, merely that it _became a problem_ after some period of time. Can you check that whether the load patterns are different at that time? In other words: does the behaviour actually change, or is it just the same behaviour escalating because of a different load?) – sehe Jan 14 '21 at 13:39
  • I finally found a workaround fortuitously! I've setup a task to check for CLOSE_WAIT sockets every hour and restart the ioservice if found more than usual. In fact, running the `ss` command during the rise of CLOSE_WAIT states (and before reaching file descriptors limit) makes all the sockets in CLOSE_WAIT state switch to LAST_ACK and then release... Finally, the ioservice has not to be restarted!!! – usnake38 Jan 23 '21 at 07:08
  • perhaps it is worth writing an answer with your finding so that it can help others find this workaround. Comments aren't indexed and may be cleaned up (deleted) – sehe Jan 23 '21 at 12:12
  • I will. Any idea of what can be the root cause of the issue? – usnake38 Jan 24 '21 at 14:26
  • Thanks @usnake38 The same as before, really. In particular, it doesn't make sense to me that _just observing using `ss`_ would get it unstuck (it **can't** because the sockets need to be closed). When I googled around yesterday I [asked this in a comment](https://stackoverflow.com/questions/15912370/how-do-i-remove-a-close-wait-socket-connection/17856372#comment116447717_17856372). – sehe Jan 24 '21 at 16:11

1 Answers1

0

I finally found a workaround fortuitously!

I've setup a task to check for CLOSE_WAIT sockets every hour and restart the ioservice if it found more than usual.

Surprisingly, the simple fact of running the ss command during the rise of CLOSE_WAIT states (and before reaching file descriptors limit) makes all the sockets in CLOSE_WAIT state switch to LAST_ACK and then release after few seconds...

Finally, the file descriptor limit is never reached and the ioservice has not to be restarted!!!

Exact command is (ran by the app process every hour) : ss -n -p state close-wait | wc -l

Server graph with workaround : https://ibb.co/qBTrykn

usnake38
  • 21
  • 2