We have an application that uses epoll to listen and process http-connections. Sometimes epoll_wait() receives close event on fd twice in a "row". Meaning: epoll_wait() returns connection fd on which read()/recv() returns 0. This is a problem, since I have malloc:ed pointer saved in epoll_event struct (struct epoll_event.data.ptr) and which is freed when fd(socket) is detected as closed the first time. Second time it crashes.
This problem occurs very rarely in real use (except one site, which actually has around 500-1000 users per server). I can replicate the problem using http siege with >1000 simultaneous connections per second. In this case application segfaults (because of invalid pointer) very randomly, sometimes after few seconds, usually after tens of minutes. I have been able to replicate the problem with fewer connections per second, but for that I have to run the application a long time, many days, even weeks.
All new accept() connection fd:s are set as non-blocking and added to epoll as one-shot, edge-triggering and waiting for read() to be available. So somewhy when the server load is high, epoll thinks that my application didn't get the close-event and queues new one?
epoll_wait() is running in it's own thread and queues fd events to be handled elsewhere. I noticed that there was multiple closes incoming with simple code that checks if there comes event twice in a row from epoll to same fd. It did happen and the events where both closes (recv(.., MSG_PEEK) told this to me :)).
epoll fd is created:
epoll_create(1024);
epoll_wait() is run as follows:
epoll_wait(epoll_fd, events, 256, 300);
new fd is set as non-blocking after accept():
int flags = fcntl(fd, F_GETFL, 0); err = fcntl(fd, F_SETFL, flags | O_NONBLOCK);
new fd is added to epoll (client is malloc:ed struct pointer):
static struct epoll_event ev; ev.events = EPOLLIN | EPOLLONESHOT | EPOLLET; ev.data.ptr = client; err = epoll_ctl(epoll_fd, EPOLL_CTL_ADD, client->fd, &ev);
And after receiving and handling data from fd, it is re-armed (of course since EPOLLONESHOT). At first I wasn't using edge-triggering and non-blocking io, but I tested it and got a nice perfomance boost using those. This problem existed before adding them though. Btw. shutdown(fd, SHUT_RDWR) is used on other threads to trigger proper close event to be received trough epoll when the server needs to close the fd because of some http-error etc (I don't actually know if this is the right way to do it, but it has worked perfectly).