How to diagnose large number of TIME_WAIT connections

Question

We have a production issue with only one of our servers and have correlated slow performance to an abundance of sockets in the TIME_WAIT state. Without drawing this question into a huge backstory, we basically know that every time the server is slow, about 80% of the server's sockets are in this TIME_WAIT state, which of course we see by running a netstat). Specifically, because TIME_WAIT times out and go away, when our server is slow we see these TIME_WAITs crop up very frequently (about ever 5 - 10 minutes).

I did a little digging and see that TIME_WAITs occur when the server closes an active connection but keeps it around in case any delayed packets come through. Eventually TIME_WAIT times out.

Anyway to see exactly why an individual socket went into the TIME_WAIT state to begin with? This is CentOS 5 - does Linux log this info in var/logs anywhere, or is there any way to do a tcpdump and look for a specific pattern that leads to a TIME_WAIT? Thanks in advance.

It's absurd — you did little digging, and thus you should know that TIME_WAIT is just another standard state all connection closed sockets go through, so what' s the purpose of asking why it went into TIME_WAIT? — Cause the connection was closed, man. — poige, Apr 05 '13 at 15:43
Yes, but *why* was the connection closed, by who, and how to determine both, man. — Mara, Apr 05 '13 at 15:55

score 1 · Accepted Answer · answered Apr 05 '13 at 12:54

1

Short answer - it is due to an app. The app creates sockets for a short time , closes them, then it immediately needs to open another socket. The sluggishness is related to the process(es) running out of sockets to use.

When creating a socket there are options - SO_REUSEADDR abnd SO_REUSEPORT. They have somewhat similar functions, but I suspect in Centos 5 SO_REUSEPORT is not available. Anyway, the optional setting on a socket call allows the port to be immediately reused.

So, a commonly used fix is to recode. It is probably a net app that connects for a few seconds then ends the session.

answered Apr 05 '13 at 12:54

jim mcnamara

429
3
8

Thanks @jim mcnamara (+1) - several quick followups: (1) is this SO_REUSEADDR a Linux construct, or something set in the hardware/NIC? (2) If we could re-code the violating app to use SO_REUSEADDR, then when the app closes a socket and then immediately tries to open a new one, I assume SO_REUSEADDR just finds the last one that is TIME_WAITed and opens it? Could that cause any problems if there are delayed packets coming in from the previous connection? And (3) how could I diagnose which app is the culprit here? FYI we're not using net app. Thanks again! – Mara Apr 05 '13 at 13:04

score 1 · Answer 2 · answered Apr 05 '13 at 15:18

It sets properties for the socket, they are then allowed/enforced by the kernel.

SO_REUSEADDR is POSIX compliant option when creating a socket.

http://pubs.opengroup.org/onlinepubs/009695399/functions/setsockopt.html

short answer - yes, and yes. So if you are making really slow connections to a lonely remote office on slow DSL, there may be an issue with "tardy" packets. But if these are connections in your LAN, probably not.
One of your apps has to be opening sockets wholesale and then closing them. lsof will show what pid has a socket open. From there you can derive user and what is being run. It could be something as simple as a bash shell script abusing netcat, for example.

Bottom line: It is either an abuse of network facilities or a code problem. And you do have a net app - this one is eating your system. My definition of net app means 'using TCP/UDP sockets'. Not necessarily a web server.

How to diagnose large number of TIME_WAIT connections

2 Answers2