0

I have a server coded in C++ running on ubuntu 10.04, currently in production, which exhibit a weird bug.

Context :

Each client connecting to the server has one socket and 2 threads

  • 1 thread for writing to the socket,
  • 1 thread for reading from the socket.

The socket is configured via ::setsockopt with SO_RCVTIMEO of 10 seconds.

Each ::send on the socket has flag MSG_NOSIGNAL set (each ::recvfrom also, but it seems it should have no impact)

Bug :

I have some evidence (but not 100% sure) that the following scenario may occur rarely :

  • ::recvfrom is called and block until either data is present or timeout is reached
  • ::send is called and the write on the socket triggers an error, returns EPIPE (Broken Pipe) error
  • Bug : ::recvfrom is still blocked, and will never return, somehow ignoring SO_RCVTIMEO option

Does the above scenario makes some sense to you ?

Metrics :

The bug happens approximatively every week. During a week, there is approximatively :

  • 20K sockets used
  • 30M ::recvfrom called
  • 60M ::send called

Should I rather use the timeout feature from ::select ? (supposing that the timeout implementation would be different from the SO_RCVTIMEO one)

Thanks a lot for any idea on this matter !

R4f
  • 31
  • 2
  • Please add a language tag. This looks like C++? – Gray Jun 14 '12 at 17:53
  • Does your metric observe cases when recvfrom times out as expected? That is, do you have a proof that SO_RCVTIMEO works in your server? – Pavel Zdenek Jun 14 '12 at 19:52
  • select() or even epoll() is indeed a more standard way to handle socket communication. – Brady Jun 15 '12 at 07:30
  • The language is C++, but it could be C. – R4f Jun 15 '12 at 08:42
  • SO_RCVTIMEO works : we use it for a ping system similar to what you found in MMO. If we don't get a ping from the client in less than 10 seconds, we close the connection. – R4f Jun 15 '12 at 08:43
  • select or epoll won't prevent the fact that I need to send data asynchronously and at any moment. Then would select or epoll unblock if send "detects" a broken pipe ? At least for the old select, it depends on how the timeout is implemented, but is it different to SO_RCVTIMEO which seems to be a facility somehow ? – R4f Jun 15 '12 at 10:35

0 Answers0