0

Currently I am implementing a muliti-thread network-client application with epoll. My model is simple:

  1. get client_fd & write request to remote server

  2. set fd nonblocking & add it to epfd(EPOLLIN|EPOLLET|EPOLLONESHOT) to wait for response

  3. get EPOLLIN from fd, read the whole response and release the resources

The problem I encounter is that occasionally I get multiple EPOLLIN on the same fd (BY USING EPOLLIN|EPOLLET|EPOLLONESHOT). Since I had released all the resources (including the client_fd) at the first EPOLLIN evt, the second evt crashed my program.

Any suggestions strongly appreciated:)

Here is the code snippet:

//multi-thread wait on the sem, since there should be only one thread 
//at epoll_wait at the same time(L-F model)
sem_wait(wait_sem); 

int nfds = epoll_wait(epoll_fd,evts,max_evt_cnt,wait_time_out);

//leader got the fds to proceed
for(int i =0; i < nfds; ++i){
    io_request* req = (io_request*)evts[i].data.ptr;
    int sockfd = req->fd;
    if(evts[i].events & EPOLLIN){
        ev.data.fd=sockfd;
        if(0!=epoll_ctl(epoll_fd,EPOLL_CTL_DEL,sockfd,&ev)){
            switch(errno){
                case EBADF:
                    //multiple EPOLLIN cause EPOLL_CTL_DEL fail
                    WARNING("delete fd failed for EBADF");
                    break;
                default:
                    WARNING("delete fd failed for %d", errno);
            }
         }
         else{
                //currently walk around by just ignore the error fd
                crt_idx.push_back(i);
         }
    }
}

if(crt_idx.size() != nfds)//just warning when the case happen
    WARNING("crt_idx.size():%u != nfds:%d there has been some error!!", crt_idx.size(), nfds);

//current leader waked up next leader, and become a follower
sem_post(wait_sem);

for(int i = 0; i < crt_idx.size(); ++i)
{
    io_request* req = (io_request*)evts[crt_idx[i]].data.ptr;
    ...do business logic...
    ...release the resources & release the client_fd
}
McGarnagle
  • 101,349
  • 31
  • 229
  • 260
elvis
  • 11
  • 1
  • What kernel version are you testing on? – David Schwartz Sep 10 '12 at 03:28
  • Linux version 2.6.9xenu_7-0-0-0 (gcc version 3.4.5 20051201 (Red Hat 3.4.5-2)) #5 SMP Thu Sep 16 22:15:55 CST 2010 – elvis Sep 10 '12 at 05:16
  • Linux version 2.6.9_5-9-0-0 (gcc version 3.4.4 20050721 (Red Hat 3.4.4-2)) #1 SMP Wed Jun 23 14:03:19 CST 2010 – elvis Sep 10 '12 at 05:16
  • hi david, thanks for replying the thread. My linux box info is append above, is there anything wrong with the kernel? – elvis Sep 10 '12 at 05:18
  • No. The bugs I was suspecting were all fixed by 2.6.4. There's one other bug that you could have been triggering (though your code above is safe) but it was fixed in 2.6.9. (That one is triggered by the last parameter to an `EPOLL_CTL_DEL` modification being NULL, but you pass `&ev` and it was fixed in 2.6.9 anyway.) – David Schwartz Sep 10 '12 at 05:23
  • Any chance another thread is calling `close` on the socket? – David Schwartz Sep 10 '12 at 05:24
  • currently i am wondering whether the second EPOLLIN event is caused by the peer close event. it seems that peer close will cause EPOLL|EPOLLRDHUP. but since my kernel is under 2.6.17 where EPOLLRDHUP is first defined... – elvis Sep 10 '12 at 06:51
  • another question that troubles me a lot is why after EPOLL_CTL_DEL successfully set on certain fd, it still got EPOLLIN event on other thread? Can't close event be ignored? – elvis Sep 10 '12 at 06:56
  • I suspect you have some kind of bug or race condition in your code somewhere. Look especially at where you close your sockets. – David Schwartz Sep 10 '12 at 21:40
  • You got it, david... Race condition cause the bug... there is nothing wrong with epoll... Thank you again for replying the thread ;-) – elvis Sep 11 '12 at 08:00

1 Answers1

0

I suspect you have some kind of bug or race condition in your code somewhere. Look especially at where you close your sockets.

David Schwartz
  • 179,497
  • 17
  • 214
  • 278