9

When conducting a stress test on some server code I wrote, I noticed that even though I am calling close() on the descriptor handle (and verifying the result for errors) that the descriptor is not released which eventually causes accept() to return an error "Too many open files".

Now I understand that this is because of the ulimit, what I don't understand is why I am hitting it if I call close() after each synchronous accept/read/send cycle?

I am validating that the descriptors are in fact there by running a watch with lsof:

ctsvr  9733 mike 1017u  sock     0,7      0t0 3323579 can't identify protocol
ctsvr  9733 mike 1018u  sock     0,7      0t0 3323581 can't identify protocol
...

And sure enough there are about 1000 or so of them. Further more, checking with netstat I can see that there are no hanging TCP states (no WAIT or STOPPED or anything).

If I simply do a single connect/send/recv from the client, I do notice that the socket does stay listed in lsof; so this is not even a load issue.

The server is running on an Ubuntu Linux 64-bit machine.

Any thoughts?

Jonathan Leffler
  • 730,956
  • 141
  • 904
  • 1,278
user1735067
  • 141
  • 1
  • 5
  • 2
    Assuming everything you say is true, this sounds like a kernel bug. A successful close absolutely must release the descriptor. – Nemo Oct 10 '12 at 14:05
  • 2
    Are you calling `shutdown` and/or consuming all data in the socket before closing the socket handles? Do you have hanging `read`s or `write`s when you call `close`? – JimR Oct 10 '12 at 14:05
  • 5
    Try using strace, if the connect/send/recv calls are as tightly coupled as you say, it should be pretty clear from the diagnostic output – Gearoid Murphy Oct 10 '12 at 14:17
  • `netstat --inet` will tell you the state of the TCP sessions. If they're "established", you haven't called `close()`. – Brian White Oct 10 '12 at 16:55
  • 1
    Post some code if you would like us to help. If you are unable to do so, use a profiler like valgrind to track descriptors. This sounds like an application bug to me. – Sam Miller Oct 11 '12 at 21:26
  • In my case it was because main and secondary threads didn't share file descriptor table(I created separate thread for each connection). To fix either add `CLONE_FILES` flag to `clone` or close socket in main thread after creating secondary thread(but not both) – quant2016 Apr 29 '21 at 11:18

3 Answers3

5

So using strace (thanks Gearoid), which I have no idea how I ever lived without, I noted I was in fact closing the descriptors.

However. And for the sake of posterity I lay bare my foolish mistake:

Socket::Socket() : impl(new Impl) {
    impl->fd = socket(AF_INET, SOCK_STREAM, IPPROTO_TCP);
    ....
}

Socket::ptr_t Socket::accept() {
    auto r = ::accept(impl->fd, NULL, NULL);
    ...
    ptr_t s(new Socket);
    s->impl->fd = r;
    return s;
}

As you can see, my constructor allocated a socket immediately, and then I replaced the descriptor with the one returned by accept - creating a leak. I had refactored the accept code from a standalone Acceptor class into the Socket class without changing this.

Using strace I could easily see socket() being run each time which lead to my light bulb moment.

Thanks all for the help!

user1735067
  • 141
  • 1
  • 5
1

Have you ever called perror() after close()? I think the returned string will give you some help;

HiJack
  • 23
  • 3
0

You are most probably hanging on a recv() or send() command. Consider setting a timeout using setsockopt .

I noticed a similar output on lsof when the socket was closed on the other end but my thread was keeping the socket open hanging on the recv() command waiting for data.

phininity
  • 1,673
  • 2
  • 14
  • 9
  • Blocking in `recv()` or `send()` doesn't prevent closing of the socket or releasing of the FD. Your second paragraph describes an impossible situation. If the peer closed the connection your `recv()` would have ceased blocking and returned a zero return value. You had some other bug altogether. -1 – user207421 Dec 30 '13 at 03:23