4

I have an app running on AIX 6.1 that has a single server process that listens for and polls tcp connections and simply forwards on messages to another process using an IPC message queue. It's been in service for years and routinely handles 1000 or more connections without problems. But yesterday we had a situation where connections to this server were timing out a large percentage of the time. At the same time connections to another instance of this server (listening on a different port number on the same machine) were working fine. Also data flowing over existing connections was fine, and the processor load was small.

This situation persisted for several hours, and continued to happen even after restarting the listener in question. Then all of a sudden this morning, it cleared up and everything's working normally.

What's really odd about this is that internal connections from a client process running on the same server machine were also getting delayed and occasionally timing out.

Is there some way that connection requests can get stuck behind a failing connection from somewhere on the network in such a way as to make the service seem unavailable to everyone for an extended period of time - and just as quickly clear up and start working normally? And for this to only affect new connections - not existing sockets being handled by the same process?

Thanks, Rob

Note - we may have resolved this. It looks like there was a network location that was achieving partial connections (perhaps inbound packets were routable, but not outbound). In any case, there were a bunch of connections in SYN_RCVD state, and the 'backlog' parameter on the server's listen socket was just 5, so a bunch of users trying to connect from the bad location was enough to eat up the entire available backlog and not let anybody else in - I guess until something timed out the partial connections.

So, I guess I'll modify my question and just ask what a good rule of thumb for setting the 'Backlog' parameter to listen(). I've bumped it from 5 to 20 for now - but we lived with 5 for years without a problem.

HopelessN00b
  • 53,795
  • 33
  • 135
  • 209
  • Your description is sufficiently curious to be interesting. Is there any possibility that you have had a trace at any level in the stack that you could review for problems. Restarting the service without resolution suggests a machine wide problem - which you say did not happen. Is this possibly related to network behaviour where most of the connections were coming from? TCP should not attack a specific port. This supports the idea the problem was not in the server process or network stack, but further away. – Pekka Apr 16 '14 at 18:33

0 Answers0