1

The new Windows API SetFileCompletionNotificationModes() with the flag FILE_SKIP_COMPLETION_PORT_ON_SUCCESS is very useful to optimize an I/O completion port loop, because you'll get less I/O completions for the same HANDLE. But it also disrupts the entire I/O completion port loop, becase you have to change a lot of things, so I thought it was better to open a new post about all of those things to change.

First of all, when you set the flag FILE_SKIP_COMPLETION_PORT_ON_SUCCESS it means that you won't receive I/O completions anymore for that HANDLE/SOCKET until all of the bytes are read (or written) so, until there is no more I/O to do, just like in unix when you got EWOULDBLOCK. When you'll receive ERROR_IO_PENDING again (so a new request will pending) it's just like getting EWOULDBLOCK in unix.

Said that, I encountered some difficulties to adapt this behavior to my iocp event loop, because a normal iocp event loop simply wait forever until there is some OVERLAPPED packet to process, the OVERLAPPED packet will be processed calling the correct callback, which in turn will decrement an atomic counter, and the loop starts to wait again, until the next packet will come.

Now, if I set FILE_SKIP_COMPLETION_PORT_ON_SUCCESS, when an OVERLAPPED packet is returned to be processed, I process it by doing some I/O (with ReadFile() or WSARecv() or whatever) and it can be pending again (if I get ERROR_IO_PENDING) or it cannot, if my I/O API completes immediately. In the former case I have just to wait the next pending OVERLAPPED, but in the latter case what I have to do?

If I try to do I/O until I get ERROR_IO_PENDING, it goes in an infinite loop, it will never return ERROR_IO_PENDING (until the HANDLE/SOCKET's counterpart stop reading/writing), so others OVERLAPPEDs will wait indefinitely. Since I am testing that with a local named pipe that writes or reads forever, it goes in an infinite loop.

So I thought to do I/O until a certain X amount of bytes, just like a scheduler assigns time slices, and if I get ERROR_IO_PENDING before X, that's ok, the OVERLAPPED will be queued again in the iocp event loop, but what about I didn't get ERROR_IO_PENDING?

I tried to put my OVERLAPPED that hasn't finished its I/O in a list/queue for later processing, calling I/O APIs later (always with max X amount of bytes), after processed others OVERLAPPEDs waiting, and I set GetQueuedCompletionStatus[Ex]() timeout to 0, so, basically the loop will process listed/queued OVERLAPPEDs that hasn't finished I/O and in the same time checking immediately for new OVERLAPPEDs without going to sleep.

When the list/queued of unfinished OVERLAPPEDs becomes empty, I can set GQCS[Ex] timeout to INFINITE again. And so on.

In theory it should work perfectly, but I have noticed a strange thing: GQCS[Ex] with timeout set to 0 returns the same OVERLAPPEDs that aren't still fully processed (so those are in the list/queue waiting for later processing) again and again.

Question 1: so if I got it right, the OVERLAPPED packet will be removed from the system only when all data is processed?

Let's say that is ok, because If I get the same OVERLAPPEDs again and again, I don't need to put them in the list/queue, but I process them only like other OVERLAPPEDs, and if I get ERROR_IO_PENDING, fine, otherwise I will process them again later.

But there is a flaw: when I call the callback for processing OVERLAPPEDs packets, I decrement the atomic counter of pending I/O operations. With FILE_SKIP_COMPLETION_PORT_ON_SUCCESS set, I don't know if the callback has been called to process a real pending operation, or just an OVERLAPPED waiting for more synchronous I/O.

Question 2: How I can get that information? I have to set more flags in the structure I derive from OVERLAPPED?

Basically I increment the atomic counter for pending operations before calling ReadFile() or WSARecv() or whatever, and when I see that it returned anything different from ERROR_IO_PENDING or success, I decrement it again. With FILE_SKIP_COMPLETION_PORT_ON_SUCCESS set, I have to decrement it again also when the I/O API completes with success, because it means I won't receive a completion.

It's a waste of time incrementing and decrementing an atomic counter when your I/O API will likely do an immediate and synchronous completion. Can't I simply increment the atomic counter of pending operations only when I receive ERROR_IO_PENDING? I didn't this before because I thought that if another thread that completes my pending I/O will be scheduled before the calling thread can check if the error is ERROR_IO_PENDING and so increment the atomic counter of pending operations, I'll get the atomic counter messed up.

Question 3: Is this a real concern? Or can I just skip that, and increment the atomic counter only when I get ERROR_IO_PENDING? It would simplify things very much.

Only a flag, and a lot of design to rethink. What are your thoughts?

Marco Pagliaricci
  • 1,366
  • 17
  • 31
  • It sounds like you have a fundamental misunderstanding of what `FILE_SKIP_COMPLETION_PORT_ON_SUCCESS` actually does, and the way you describe things makes me think you are not using OVERLAPPED/IOCP operations correctly to begin with. And this was a very long description without any actual code to show what you are asking about. – Remy Lebeau Mar 27 '14 at 23:12
  • I'm sorry for the long description (I don't have a testcase to show the code yet, but I'll try to make one). Why did you say that IOCP operations are not used correctly? Having an atomic counter counting the pending i/o operations is quite normal, isn't it? Otherwise how you can keep track of the *last* pending operation? And for what I read on MSDN `FILE_SKIP_COMPLETION_PORT_ON_SUCCESS` simply skip unuseful I/O completions when the data is *immediately* available. Doesn't it do that? Can you elaborate your comments, so I can learn? :) thanks for the reply! – Marco Pagliaricci Mar 28 '14 at 09:10
  • I have never seen a case where IOCP code uses a counter to keep track of pending operations. Almost always, an I/O function is passed a pointer to a record/class that contains the buffer being acted on, and then that pointer is reported by the IOCP event when the requested I/O is finished. Typically, such buffers are dynamically allocated and then freed when the event triggers. – Remy Lebeau Mar 28 '14 at 21:23
  • As for `FILE_SKIP_COMPLETION_PORT_ON_SUCCESS`, lets say you are reading data and data is already available. Normally, the read would return PENDING and schedule the IOCP event immediately, so you still have to wait for that event before you can process the buffer. With `FILE_SKIP_COMPLETION_PORT_ON_SUCCESS` enabled, the read would return SUCCESS and you would process the buffer immediately without waiting for the event first. – Remy Lebeau Mar 28 '14 at 21:26
  • Well, that's just exactly what I have in mind, I'm sorry if I couldn't explain that in plain english, and there was some misunderstandings. However, the atomic counter is needed, when you have a class, e.g. Stream that abstracts an HANDLE (or a SOCKET), how you can a callback (e.g. onClose()) when the last pending operation has been completed on the socket, if *multiple* threads can do i/o operation on that socket? For instance, 1 thread is completing a reading i/o operation, and another one is completing a write operation. After calling onRead(..); and onWrite(..); -> continue -> – Marco Pagliaricci Mar 28 '14 at 22:30
  • which thread should call onClose(); ? An atomic counter is the answer, the 2 different iocp threads after calling onRead(..) (and the other thread onWrite(..)) *both* decrements an atomic counter, so the last one that sees the counter==0 call onClose(); callback which in turn will free memory for Stream object and for buffers, and so on. If you don't do that, you cannot know exactly if there still is a pending OVERLAPPED packet, so *when* to deallocate the Stream object which holds the HANDLE, flags and other data. – Marco Pagliaricci Mar 28 '14 at 22:31
  • Len Holgate discussed many times this, in StackOverflow posts, and he uses this design in his Server Framework: the per-HANDLE atomic counter, and he also uses also a per-OVERLAPPED atomic counter. After trying many tests, I must say that the Len's method of using a per-HANDLE atomic counter is very effective, and useful to identify when you're done with your data structures, there are no more pending i/o operations, and you can free memory. – Marco Pagliaricci Mar 28 '14 at 22:33
  • In my code, I have a thread that creates/accepts a socket, then posts a `WSARecv()` to start reading from it. Other threads are free to write to the socket at any time using `WSASend()`. Each buffer is dynamically allocated. A dedicated thread handles the IOCP events. When the IOCP thread detects a read event, it passes the buffer data to a separate processing thread and then posts another `WSARecv()` using the same buffer. When the IOCP thread detects a write event, it posts another `WSASend()` using the same buffer if not complete, otherwise it frees the buffer... – Remy Lebeau Mar 28 '14 at 22:54
  • ... If the IOCP thread detects an error, it frees the reported buffer and notifies the thread that created/accepted the socket associated with the buffer, so that thread can then close that socket. No counters are used or needed at all. – Remy Lebeau Mar 28 '14 at 22:55
  • I see, this is a possible design. Thanks for sharing. But from what I see you use a single thread to process IOCP events? You said that other threads (so I guess multiple threads, and not only 1) can write what they want calling WSASend(). Do you use a lock mechanism to do that, or just call WSASend()? When it detects an error, it alerts the IOCP loop to close and free the socket, but how you can know that there aren't still pending OVERLAPPED packets for that sockets? I'm using a different approach. I have multiple threads "listening" on the same completion port, so I have -> continue -> – Marco Pagliaricci Mar 29 '14 at 08:54
  • -> so I have multiple threads that can read OVERLAPPEDs packets for the same sockets: e.g. 1 thread can complete a WSARecv()-issued OVERLAPPED "A", and another thread in the *same* time may complete a WSASend()-issued OVERLAPPED packet "B". In this scenario how you can know which one, A or B, completed last, without any counter at all? I guess you can't. When multiple threads are completing multiple OVERLAPPEDs for the same socket, the only way to acknowledge that "ok *all* pendings OVERLAPPEDs are done, you can free memory and stuff now, you won't receive any completion" is an atomic counter. – Marco Pagliaricci Mar 29 '14 at 08:58
  • If WSASend() fails, the IOCP event simply frees that buffer. I only close the socket on WSARecv() failures. If WSARecv() fails, another WSARecv() is not queued, and if it fails while WSASend() is pending, that is OK since the socket closure will abort WSASend() and trigger an IOCP event for it. – Remy Lebeau Mar 29 '14 at 18:30

1 Answers1

2

As Remy states in the comments: Your understanding of what FILE_SKIP_COMPLETION_PORT_ON_SUCCESS does is wrong. ALL it does is allow you to process the completed operation 'in line' if the call that you made (say WSARecv() returns 0.

So, assuming you have a 'handleCompletion()' function that you would call once you retrieve the completion from the IOCP with GQCS(), you can simply call that function immediately after your successful WSARecv().

If you're using a per-operation counter to track when the final operation completes on a connection (and I do this for lifetime management of the per-connection data that I associate as a completion key) then you still do this in exactly the same way and nothing changes...

You can't increment ONLY on ERROR_IO_PENDING because then you have a race condition between the operation completing and the increment occurring. You ALWAYS have to increment before the API which could cause the decrement (in the handler) because otherwise thread scheduling can screw you up. I don't really see how skipping the increment would "simplify" things at all...

Nothing else changes. Except...

  1. Of course you could now have recursive calls into your completion handler (depending on your design) and this was something which was not possible before. For example; You can now have a WSARecv() call complete with a return code of 0 (because there is data available) and your completion handling code can issue another WSARecv() which could also complete with a return code of 0 and then your completion handling code would be called again possibly recursively (depending on the design of your code).

  2. Individual busy connections can now prevent other connections for getting any processing time. If you have 3 concurrent connections and all of the peers are sending data as fast as they can and this is faster than your server can process the data and you have, for example, 2 I/O threads calling GQCS() then with FILE_SKIP_COMPLETION_PORT_ON_SUCCESS you may find that two of these connections monopolise the I/O threads (all WSARecv() calls return success which results in inline processing of all inbound data). In this situation I tend to have a counter of "max consecutive I/O operations per connection" and once that counter reaches a configurable limit I post the next inline completion to the IOCP and let it be retrieved by a call to GQCS() as this allows other connections a chance.

Len Holgate
  • 21,282
  • 4
  • 45
  • 92
  • Well, this is exactly what I was doing. Sorry if I messed up and I couldn't explain well, but my concern was about this: let's say we have an IOCP loop like one you just described, and we have `handleCompletion()`, if I call that after an inline `WSARecv()` I won't receive I/O OVERLAPPEDs requests anymore for that socket! Unless I do loop `WSARecv()` until it returns an ERROR_IO_PENDING. That's simply because if it completes "inline", it doesn't put the OVERLAPPED again for waiting the next I/O. Is my understanding correct? – Marco Pagliaricci Mar 31 '14 at 08:55
  • No. You only ever get ONE completion for each and every overlapped operation that you start. If you have enabled `FILE_SKIP_COMPLETION_PORT_ON_SUCCESS` then that completion is signalled by a success return from the overlapped API rather than being signalled by the completion being returned by a call to `GQCS()`. That's the ONLY difference. The only way to get more completions for a given `OVERLAPPED` structure is to initiate another overlapped I/O request. Each `OVERLAPPED` structure used in each and every I/O request MUST be unique (until the I/O completes, when it can be reused). – Len Holgate Mar 31 '14 at 09:39
  • Let me rephrase that: let's suppose I use only 1 OVERLAPPED for a SOCKET where I want only to *read* data, so I use `WSARecv()`. No other I/O operations in this socket. Now I want to keep that OVERLAPPED, instead of destroying and allocating a new one for each `WSARecv()` request, and this is done simply. I call `WSARecv()` and it returns ERROR_IO_PENDING. In the `GQCS()` loop I'll receive that OVERLAPPED, so I call `handleCompletion()` to make sure the `onRead()` callback is called. Ok, done. Now I want to receive *more* I/O on that socket, the only way is to call `WSARecv()` with that same-> – Marco Pagliaricci Mar 31 '14 at 10:07
  • -> same OVERLAPPED I have just used. Now there 2 possibilities: `WSARecv()` returns ERROR_IO_PENDING, this means that I'll get that OVERLAPPED again somewhere maybe in some other thread, OR it returns 0, it means that completed I/O inline, and immediately. I call `handleCompletion()` again, because there is more data to pass to `onRead()` that we've just "read" inline, then, what now? Now, I have to call `WSARecv()` again if I want that OVERLAPPED pending again, but what I have to do if it returns 0 again, so it completes read inline again? A loop until I receicve ERROR_IO_PENDING? – Marco Pagliaricci Mar 31 '14 at 10:07
  • Yes. If an I/O call returns synchronously it just means that (in the case of a recv) that there was already data available in the TCP stack's buffers. As I said before, there is no difference apart from where you process the completion. If the peer is sending data faster than you are processing it then you may never receive an ERROR_IO_PENDING. Be careful about the potentially recursive nature of all of this (depends on how your code is structured); if the WSARecv() is being called from your 'handleCompletion()' and that in turn calls 'handleCompletion()' you could end up overflowing the stack – Len Holgate Mar 31 '14 at 11:03
  • Perfect. Sorry for all the mess, but this is exactly the point I wanted to discuss from the start! Couldn't describe it very well, sorry. Now, what I want to prove is that if I'm looping `WSARecv()` (or btw `ReadFile()`, it depends from the nature of the I/O endpoint), I can incur in an infinite loop if I have a local named pipe that forerver writes! Dunno if this happens also with sockets, but with named pipes does happen, so, if we don't want our application to hang reading from a malicious named pipe that forever writes, we have to handle this, and my solution is -> – Marco Pagliaricci Mar 31 '14 at 11:25
  • -> is to stop calling `WSARecv()` or `ReadFile()` after reading *immediately* (so inline) MAX bytes. Now, what I have to do after inline-reading MAX bytes? My first thought was to set the timeout of `GQCS()` to 0, and process "unfinished" readings interspersed with other OVERLAPPED completions, but it doesn't work very well. So now, I make a post to completion port, which reads again until ERROR_IO_PENDING. If it reads > MAX bytes, it make another post to the completion port, and so on! It works pretty fine. – Marco Pagliaricci Mar 31 '14 at 11:26
  • I still don't understand what you're saying; you are reading from a pipe which always has data ready? That's no different to having a socket peer that writes faster than you can read. I've added a note about connection starvation to my answer above. – Len Holgate Mar 31 '14 at 13:14
  • Oh, great, I have just read your edits. Thank you, so this is just what I was saying from the start: having a max i/o operation (or a counter for max bytes per i/o operation) before we stop the loop. :) When you'll get the max i/o operation per socket, do you post the next i/o "session" of operations with `PostQueuedCompletionStatus()` ? – Marco Pagliaricci Mar 31 '14 at 16:40
  • 2
    When I hit my "max ops per connection" counter due to "too many" operations completing immediately for a single connection I post the current completion (the one that took the counter over the limit) to the IOCP rather than processing it in-line, I then 'return' out of the handler and loop back around to GQCS() (though it's a little more complex than that in my framework). Whenever I process an operation for a connection that is retrieved from the IOCP via GQCS() I reset the 'consecutive operations' counter for the connection. – Len Holgate Mar 31 '14 at 17:23