4

I'm looking for some way to get a signal on an I/O completion port when a socket becomes readable/writeable (i.e. the next send/recv will complete immediately). Basically I want an overlapped version of WSASelect.

(Yes, I know that for many applications, this is unnecessary, and you can just keep issuing overlapped send calls. But in other applications you want to delay generating the message to send until the last moment possible, as discussed e.g. here. In these cases it's useful to do (a) wait for socket to be writeable, (b) generate the next message, (c) send the next message.)

So far the best solution I've been able to come up with is to spawn a thread just to call select and then PostQueuedCompletionStatus, which is awful and not particularly scalable... is there any better way?

Nathaniel J. Smith
  • 11,613
  • 4
  • 41
  • 49
  • 1
    you not need all this. after socket connected he all time "writeable" and "readable". you can have multiple overlapped send at time. however only one recv request at time exist sense have. you need make recv just after connect, and then after previous recv finished. until disconnect. "will complete immediately" - when using asynchronous io this no sense – RbMm Jan 08 '17 at 00:10
  • 1
    I do need this and explained why in the question -- it's a way to minimize send buffering for latency-sensitive applications. (Alternatively I guess it would also be OK if there were a way to get an alert when the total send buffer size dropped below some low water mark, but I'm even less hopeful of that existing...) – Nathaniel J. Smith Jan 08 '17 at 00:18
  • 1
    when send is finished - you got notify about this. so for example when you need send big data - you can send only chunk. when send of this chunk is finished - you got notify in IOCP about this, and inside this notify - send another chunk. and so on.. i already many times do this – RbMm Jan 08 '17 at 00:22
  • There doesn't actually seem to *be* a function named WSASelect. Do you just mean select()? Have you looked at WSAAsyncSelect? – Harry Johnston Jan 08 '17 at 00:23
  • 1
    `Select` not need at all when we using IOCP. just our callback called when any operation finished. – RbMm Jan 08 '17 at 00:27
  • I think @RbMm is right, except that you might have to set the buffer size to zero first, see, e.g., https://support.microsoft.com/en-us/kb/214397. – Harry Johnston Jan 08 '17 at 00:29
  • some pseudo code : `/*callback when Send Finished*/ void OnSend() { if (m_cbLeft) { cb=max(maxchunk,m_cbLeft); Send(cb); m_cbLeft -= cb; } }` – RbMm Jan 08 '17 at 00:29
  • ... the important question is what exactly it means when the previous asynchronous send() call is reported to be complete. I *suspect* that you'll find that this happens at the same time that select() reports the socket to be writable, at least if you've disabled buffering. But I don't know that for a fact. One of the experts might be able to answer. – Harry Johnston Jan 08 '17 at 00:34
  • i mean next - we can easy control send data size, which currently buffering in driver; let be `_size`. when we call Send with `cb` data - `_size += cb`, when Send with `cb` finished - `_size -= cb`. if we view that `_size` become too large - we stop send. and in `onsend` callback, when we decrement `_size` and view that it become small enough - again call size. all sense here call additional send from onsend callback – RbMm Jan 08 '17 at 00:37
  • @HarryJohnston - `what exactly it means when the previous asynchronous send() call is reported to be complete.` - until tcp driver process send he used our send buffer - the send buffer must be valid and *not changed* until the send complete. (this because kernel here MDL used for direct map our buffer). when send complete - this mean that data really send by tcp driver – RbMm Jan 08 '17 at 00:41
  • @HarryJohnston - dont know are this documented, but from my experience - tcp driver not copy data to kernel when we call send, but direct map our buffer to kernel space and used it all time until transmit data over net. only when send really finished or fail - the send operation will be completed and OVERLAPPED (IO_STATUS_BLOCK) queued to IOCP – RbMm Jan 08 '17 at 00:46
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/132601/discussion-between-rbmm-and-harry-johnston). – RbMm Jan 08 '17 at 01:01
  • 1
    You need more than just 'writable' You need to use the low-watemark settings described in your link. @RbMm If your claim about the TCP send buffer was true, retransmissions would be impossible and the TCP send buffer would be redundant. – user207421 Jan 08 '17 at 01:31
  • @HarryJohnston: doh, you're right, I meant `select`. Edited question to reflect this. – Nathaniel J. Smith Jan 08 '17 at 03:07
  • @HarryJohnston: AFAICT from [KB214397](https://support.microsoft.com/en-us/kb/214397), Windows roughly speaking considers a socket writeable if the data from the next-to-last call to `send` has all been transmitted, at which point the next `send` call will complete instantly. Also AFAICT IOCP considers a `send` complete when the data has been copied into a kernel buffer; this generally happens before the data is transmitted, and certainly before it's ACKed. – Nathaniel J. Smith Jan 08 '17 at 03:12
  • It says "In most cases, the send completion in the application only indicates the data buffer in an application send call is copied to the Winsock kernel buffer and does not indicate that the data has hit the network medium. **The only exception is when you disable the Winsock buffering by setting SO_SNDBUF to 0.**" (emphasis mine) which to me implies that if you disable the buffer then send completion *does* mean that the data has hit the network. But I don't see how this fits in with the need for retransmission, so YMMV. – Harry Johnston Jan 08 '17 at 20:57
  • Yeah, it's not clear to me how `SO_SNDBUF=0` works either. And it sounds like it might disable buffering entirely, which you really don't want, b/c then the link goes idle in between `send` calls... – Nathaniel J. Smith Jan 09 '17 at 01:49
  • Can I recommend you to consider RIO? If this is the latency above all case, you could have a dedicated thread to poll the TX completion queue, and upon successful completion to construct a buffer and RIOSend() it, then return to regular polling business. You will burn a CPU core per completion queue with that. Now, we enter trade-off territory: you could use a wait loop driven by a completion port, a wait loop driven by an event, or a thread pool wait loop. All depends on latency/throughput trade-off. – Sergei Vorobiev Mar 07 '17 at 06:39

2 Answers2

6

It turns out that this is possible!

Basically the trick is:

  • Use the WSAIoctl SIO_BASE_HANDLE to peek through any "layered service providers"
  • Use DeviceIoControl to submit an AFD_POLL request for the base handle, to the AFD driver (this is what select does internally)

There are many, many complications that are probably worth understanding, but at the end of the day the above should just work in practice. This is supposed to be a private API, but libuv uses it, and MS's compatibility policies mean that they will never break libuv, so you're fine. For details, read the thread starting from this message: https://github.com/python-trio/trio/issues/52#issuecomment-424591743

Nathaniel J. Smith
  • 11,613
  • 4
  • 41
  • 49
3

For detecting that a socket is readable, it turns out that there is an undocumented but well-known piece of folklore: you can issue a "zero byte read", i.e., an overlapped WSARecv with a zero-byte receive buffer, and that will not complete until there is some data to be read. This has been recommended for servers that are trying to do simultaneous reads from a large number of mostly-idle sockets, in order to avoid problems with memory usage (apparently IOCP receive buffers get pinned into RAM). An example of this technique can be seen in the libuv source code. They also have an additional refinement, which is that to use this with UDP sockets, they issue a zero-byte receive with MSG_PEEK set. (This is important because without that flag, the zero-byte receive would consume a packet, truncating it to zero bytes.) MSDN claims that you can't combine MSG_PEEK with overlapped I/O, but apparently it works for them...

Of course, that's only half of an answer, because there's still the question of detecting writability.

It's possible that a similar "zero-byte send" trick would work? (Used directly for TCP, and adding the MSG_PARTIAL flag on UDP sockets, to avoid actually sending a zero-byte packet.) Experimentally I've checked that attempting to do a zero-byte send on a non-writable non-blocking TCP socket returns WSAEWOULDBLOCK, so that's a promising sign, but I haven't tried with overlapped I/O. I'll get around to it eventually and update this answer; or alternatively if someone wants to try it first and post their own consolidated answer then I'll probably accept it :-)

Nathaniel J. Smith
  • 11,613
  • 4
  • 41
  • 49
  • I would argue that writeability is a function of io size. My hunch is that as long as the tcp-send-buffer-size minus combined-size-of-incomplete-sends is above your payload, WSASend will return pending. Now, this means that you have to process completions rapidly enough so that your heuristic is up-to-date, and that you have an idea of what send buffer size is. Enter Nagel's Algo.... – Sergei Vorobiev Mar 07 '17 at 06:16
  • Thanks for sharing, this is a very useful trick! I too am curious to hear whether a similar trick would work for detecting writability. – tmm1 Mar 26 '18 at 22:23
  • FYI the 0-byte read reference link in your answer is no longer working, but it used to refer to Network Programming for Microsoft Windows 2nd Ed. Page 194 – tmm1 Mar 26 '18 at 23:45
  • 1
    @tmm1 Thanks for the heads up, I've replaced it with an archive.org link – Nathaniel J. Smith Mar 26 '18 at 23:50
  • 1
    I ran a test and it appears the same trick does not work for writability. I created a tcp connection, set it to non-blocking, wrote to it with (non-overlapped) `WSASend`s until it returned `EWOULDBLOCK` (wrote ~639kb). Then I tried a overlapped `WSASend` with a zero-byte `WSABuf`: the operation returns `WSA_IO_PENDING`, but a completion event arrived immediately with `lpcbTransfer` set to 0. – tmm1 Apr 18 '18 at 20:49
  • @tmm1 really? that's a shame :-( – Nathaniel J. Smith Apr 19 '18 at 08:22