3

Short version: I get WSA_IO_PENDING when using blocking socket API calls. How should I handle it? The socket has overlapped I/O attribute and set with a timeout.

Long version:

Platform: Windows 10. Visual Studio 2015

A socket is created in a very traditional simple way.

s = ::socket(AF_INET, SOCK_STREAM, IPPROTO_TCP);

The socket has by default overlapped I/O attribute enabled. It can be verified with getsockop / SO_OPENTYPE.

  • I do need overlapped attribute because I want to use timeout feature, e.g. SO_SNDTIMEO.
  • And I would use the socket only in blocking (i.e., synchronous) manner.
  • socket read operation runs only within a single thread.
  • socket write operation can be performed from different threads synchronized with the mutex.

The socket is enabled with timeout and keep-alive with...

::setsockopt(s, SOL_SOCKET, SO_RCVTIMEO, ...);

::setsockopt(s, SOL_SOCKET, SO_SNDTIMEO, ...);

::WSAIoctl(s, SIO_KEEPALIVE_VALS, ...);

The socket operations are done with

::send(s, sbuffer, ssize, 0); and

::recv(s, rbuffer, rsize, 0);

I also try to use WSARecv and WSASend with both lpOverlapped and lpCompletionRoutine set to NULL.

[MSDN] ... If both lpOverlapped and lpCompletionRoutine are NULL, the socket in this function will be treated as a non-overlapped socket.

::WSARecv(s, &dataBuf, 1, &nBytesReceived, &flags, NULL/*lpOverlapped*/, NULL/*lpCompletionRoutine*/)

::WSASend(s, &dataBuf, 1, &nBytesSent, 0, NULL/*lpOverlapped*/, NULL/*lpCompletionRoutine*/)

The Problem:

Those send / recv / WSARecv / WSASend blocking calls would return error with WSA_IO_PENDING error code!

Questions:

Q0: any reference on overlapped attribute with blocking call and timeout?

How does it behave? in case I have a socket with overlapped "attribute" + timeout feature enable, and just use blocking socket API with "none-overlapped I/O semantics".

I could not find any reference yet about it (e.g. from MSDN).

Q1: is it expected behavior?

I observed this issue (get WSA_IO_PENDING) after migrating code from Win XP/ Win 7 to Win 10.

Here is client code part: (note: the assert is not used in real code, but just describes here that the corresponding error would be handled and a faulty socket will stop the procedure..)

    auto s = ::socket(AF_INET, SOCK_STREAM, IPPROTO_TCP);
    assert(s != INVALID_SOCKET);

    timeval timeout;
    timeout.tv_sec = (long)(1500);
    timeout.tv_usec = 0;

    assert(::setsockopt(s, SOL_SOCKET, SO_RCVTIMEO, (const char*)&timeout, sizeof(timeout)) != SOCKET_ERROR);

    assert(::setsockopt(s, SOL_SOCKET, SO_SNDTIMEO, (const char*)&timeout, sizeof(timeout)) != SOCKET_ERROR);

    struct tcp_keepalive
    {
      unsigned long onoff;
      unsigned long keepalivetime;
      unsigned long keepaliveinterval;
    } heartbeat;
    heartbeat.onoff             = (unsigned long)true;                         
    heartbeat.keepalivetime     = (unsigned long)3000;
    heartbeat.keepaliveinterval = (unsigned long)3000;
    DWORD nob = 0;

    assert(0 == ::WSAIoctl(s, SIO_KEEPALIVE_VALS, &heartbeat, sizeof(heartbeat), 0, 0, &nob, 0, 0));

    SOCKADDR_IN connection;
    connection.sin_family = AF_INET;
    connection.sin_port = ::htons(port);
    connection.sin_addr.s_addr = ip;

    assert(::connect(s, (SOCKADDR*)&connection, sizeof(connection)) != SOCKET_ERROR);

    char buffer[100];
    int receivedBytes = ::recv(s, buffer, 100, 0);

    if (receivedBytes > 0)
    {
      // process buffer
    }
    else if (receivedBytes == 0)
    {
      // peer shutdown
      // we will close socket s
    }
    else if (receivedBytes == SOCKET_ERROR)
    {
      const int lastError = ::WSAGetLastError();
      switch (lastError)
      {
      case WSA_IO_PENDING:
          //.... I get the error!
      default:
      }
    }

Q2: How should I handle it?

Ignore it? or just close socket as a usual error case?

From the observation, once I get WSA_IO_PENDING, and if I just ignore it, the socket would become eventually not responsive anymore..

Q3: How about WSAGetOverlappedResult?

does it make any sense?

What WSAOVERLAPPED object should I give? Since there is no such one I use for all those blocking socket calls.

I have tried just create a new empty WSAOVERLAPPED and use it to call WSAGetOverlappedResult. It will eventually return with success with 0 byte transferred.

rnd_nr_gen
  • 2,203
  • 3
  • 36
  • 55
  • about Q3 - all absolute clear - `GetOverlappedResult` only accept adress of `OVERLAPPED` used in io request. pass another address is nonsense. as result if you use `send` of `recv` - you can not call `GetOverlappedResult`. *It will eventually return with success with 0 byte transferred.* - look for `ULONG cb, dwFlags; OVERLAPPED ov = { 0, 0x333}; WSAGetOverlappedResult(s,&ov, &cb, 0, &dwFlags);` and you got 0x333 bytes transfered:) - the result of operation and bytes transferred **inside** overlapped - because this you need only use exactly pointer which used in io request – RbMm Sep 20 '18 at 08:40
  • about Q1 - can you post minimal but **complete** code, which give this error ? – RbMm Sep 20 '18 at 08:40
  • if want more understand about `WSAGetOverlappedResult` try next code examples with valid socket : `ULONG cb, dwFlags; OVERLAPPED ov = { STATUS_RECEIVE_PARTIAL, 0x333}; WSAGetOverlappedResult(s,&ov, &cb, 0, &dwFlags);` - you got 0x333 in cb and `dwFlags == MSG_PARTIAL` or `ULONG cb, dwFlags; OVERLAPPED ov = { STATUS_CONNECTION_RESET }; ASSERT(!WSAGetOverlappedResult(s,&ov, &cb, 0, &dwFlags)); ASSERT(GetLastError() == WSAECONNRESET);` – RbMm Sep 20 '18 at 08:49
  • thx for the explanation!! I added the client code part.. dead simple.. now I really don't know is there any way except just treat WSA_IO_PENDING as fatal error, since it is not expected in a blocking case. or? The alternative to rewrite it with proper aync way I would prefer to avoid... – rnd_nr_gen Sep 20 '18 at 09:10
  • arrr...it is just example.. the code example with assert just to describe the code workflow which is not important in this question.. otherwise I could write complete error handling here. just want to have short version – rnd_nr_gen Sep 20 '18 at 09:30
  • 1
    because you not pass pointer to self `OVERLAPPED` to `WSPSend` it allocate own from stack. then send request to kernel via ioctl. ioctl return `STATUS_PENDING`. in this case with local `OVERLAPPED` - system begin wait in place - if you not use `SO_RCVTIMEO` - infinite, otherwise exactly time (1500 in your case) which you set yourself. i can advice *for test* remove `SO_RCVTIMEO` - guess will be no more such error. if wait timeout (your case) - the `CancelIo` called, after this system read final operation status from `OVERLAPPED`. if here still `STATUS_PENDING` you got `WSA_IO_PENDING`... – RbMm Sep 20 '18 at 10:32
  • 1
    this is already device depended but normally when driver cancel io operation - it set final status to `STATUS_CANCELLED` or another error. if driver left `STATUS_PENDING` as final status - this is device error. i can not reproduce such error - i got `STATUS_CANCELLED` from device as excepted. winsock layer change it to `STATUS_IO_TIMEOUT` here and return `WSAETIMEDOUT` finally. if error such on your system persist - look like device bug. interesting build minimal exe for test. look in on another systems. – RbMm Sep 20 '18 at 10:38
  • impressive! it does explain the behavior. Feedback on your advice: 1) I can image that removing SO_RCVTIMEO would avoid the issue, however, it is kind of legacy code and the logic relies on timeout unfortunately. I would rather rewrite the whole if possible. 2) I initially guessed Win10 internal change than 'device bug'. Do you mean NIC device? – rnd_nr_gen Sep 20 '18 at 10:54
  • I also playing now around with a mock socket server. 1) server/WSASend(with valid overlapped and event) + client/recv 2) server/send + client/recv 1) will produce more WSA_IO_PENDING issues than 2) – rnd_nr_gen Sep 20 '18 at 10:58
  • I faced the very same issue on a regular basis using an Ada2012 DLL based on GNAT sockets with the timeout feature. I never managed to reproduce the error on my machines (Win7/Win 10), but on a specific deployment machine (win 10), it occured frequently. Maybe this could be related to network card driver issues ? or a specific service pack and/or windows patch ? or another interfering app doing network calls ? I could not point to any clear cause ... Any ideas since 2018 ? – LoneWanderer Jun 13 '22 at 16:38

1 Answers1

1

Q3: How about WSAGetOverlappedResult?

in [WSA]GetOverlappedResult we can only use pointer to WSAOVERLAPPED passed to I/O request. use any another pointer is senseless. all info about I/O operation WSAGetOverlappedResult get from lpOverlapped (final status, number of bytes transferred, if need wait - it wait on event from this overlapped). in general words - every I/O request must pass OVERLAPPED (IO_STATUS_BLOCK really) pointer to kernel. kernel direct modify memory (final status and information (usually bytes transferred). because this lifetime of OVERLAPPED must be valid until I/O not complete. and must be unique for every I/O request. the [WSA]GetOverlappedResult check this memory OVERLAPPED (IO_STATUS_BLOCK really) - first of all look for status. if it another from STATUS_PENDING - this mean that operation completed - api take number of bytes transferred and return. if still STATUS_PENDING here - I/O yet not complete. if we want wait - api use hEvent from overlapped to wait. this event handle is passed to kernel during I/O request and will be set to signal state when I/O finished. wait on any another event is senseless - how it related to concrete I/O request ? think now must be clear why we can call [WSA]GetOverlappedResult only with exactly overlapped pointer passed to I/O request.

if we not pass pointer to OVERLAPPED yourself (for example if we use recv or send) the low level socket api - yourself allocate OVERLAPPED as local variable in stack and pass it pointer to I/O. as result - api can not return in this case until I/O not finished. because overlapped memory must be valid until I/O not complete (in completion kernel write data to this memory). but local variable became invalid after we leave function. so function must wait in place.

because all this we can not call [WSA]GetOverlappedResult after send or recv - at first we simply have no pointer to overlapped. at second overlapped used in I/O request already "destroyed" (more exactly in stack below top - so in trash zone). if I/O yet not completed - the kernel already modify data in random place stack, when it finally completed - this will be have unpredictable effect - from nothing happens - to crash or very unusual side effects. if send or recv return before I/O completed - this will be have fatal effect for process. this never must be (if no bug in windows).

Q2: How should I handle it?

how i try explain if WSA_IO_PENDING really returned by send or recv - this is system bug. good if I/O completed by device with such result (despite it must not) - simply some unknown (for such situation) error code. handle it like any general error. not require special processing (like in case asynchronous io). if I/O really yet not completed (after send or recv returned) - this mean that at random time (may be already) your stack can be corrupted. effect of this unpredictable. and here nothing can be done. this is critical system error.

Q1: is it expected behavior?

no, this is absolute not excepted.

Q0: any reference on overlapped attribute with blocking call and timeout?

first of all when we create file handle we set or not set asynchronous attribute on it: in case CreateFileW - FILE_FLAG_OVERLAPPED, in case WSASocket - WSA_FLAG_OVERLAPPED. in case NtOpenFile or NtCreateFile - FILE_SYNCHRONOUS_IO_[NO]NALERT (reverse effect compare FILE_FLAG_OVERLAPPED). all this information stored in FILE_OBJECT.Flags - FO_SYNCHRONOUS_IO (The file object is opened for synchronous I/O.) will be set or clear.

effect of FO_SYNCHRONOUS_IO flag is next: I/O subsystem call some driver via IofCallDriver and if driver return STATUS_PENDING - in case FO_SYNCHRONOUS_IO flag set in FILE_OBJECT - wait in place(so in kernel) until I/O not completed. otherwise return this status - STATUS_PENDING for caller - it can wait yourself in place, or receiver callback via APC or IOCP.

when we use socket it internal call WSASocket -

The socket that is created will have the overlapped attribute as a default

this mean file will be not have FO_SYNCHRONOUS_IO attribute and low level I/O calls can return STATUS_PENDING from kernel. but let look how recv is worked:

internally WSPRecv is called with lpOverlapped = 0. because this - WSPRecv yourself allocate OVERLAPPED in stack, as local variable. before do actual I/O request via ZwDeviceIoControlFile. because file (socket) created without FO_SYNCHRONOUS flag - the STATUS_PENDING is returned from kernel. in this case WSPRecv look - are lpOverlapped == 0. if yes - it can not return, until operation not complete. it begin wait on event (internally maintain in user mode for this socket) via SockWaitForSingleObject - ZwWaitForSingleObject. in place Timeout used value which you associated with socket via SO_RCVTIMEO or 0 (infinite wait) if you not set SO_RCVTIMEO. if ZwWaitForSingleObject return STATUS_TIMEOUT (this can be only in case you set timeout via SO_RCVTIMEO) - this mean that I/O operation not finished in excepted time. in this case WSPRecv called SockCancelIo (same effect as CancelIo). CancelIo must not return (wait) until all I/O request on file (from current thread) will be completed. after this WSPRecv read final status from overlapped. here must be STATUS_CANCELLED (but really the concrete driver decide with which status complete canceled IRP). the WSPRecv convert STATUS_CANCELLED to STATUS_IO_TIMEOUT. then call NtStatusToSocketError for convert ntstatus code to win32 error. say STATUS_IO_TIMEOUT converted to WSAETIMEDOUT. but if still was STATUS_PENDING in overlapped, after CancelIo - you got WSA_IO_PENDING. only in this case. look like device bug, but i can not reproduce it on own win 10 (may be version play role)


what can be do here (if you sure that really got WSA_IO_PENDING) ? first try use WSASocket without WSA_FLAG_OVERLAPPED - in this case ZwDeviceIoControlFile never return STATUS_PENDING and you never must got WSA_IO_PENDING. check this - are error is gone ? if yes - return overlapped attribute and remove SO_RCVTIMEO call (all this for test - not solution for release product) and check are after this error is gone. if yes - look like device invalid cancel (with STATUS_PENDING ?!?) IRP. sense of all this - locate where is error more concrete. anyway interesting will be build minimal demo exe, which can stable reproduce this situation and test it on another systems - are this persist ? are only for concrete versions ? if it can not be reproduced on another comps - need debug on your concrete

RbMm
  • 31,280
  • 3
  • 35
  • 56
  • it is really a good advice for me to look and try deeper! and just want to know one more thing... how do you get such "internal" knowledge (wrt Q0's answer)? pretty helpful to understand the issue. – rnd_nr_gen Sep 20 '18 at 12:43
  • @rnd_nr_gen - situation is interesting. if you want research this - first of all need build minimal exe which can stable reproduce it. say based on your code (i build - but for me all work correct). try remove all extended (like `SIO_KEEPALIVE_VALS` - think not related to problem, when `SO_RCVTIMEO` is very important). if you build exe which stable reproduce pending error - first of all check - where this - concrete windows versions. exactly 1 comp ? try modifications - disable asynchronous io first, then enable it and remove timeout. anyway will be interesting exactly research this – RbMm Sep 20 '18 at 13:10