I'm looking into a problem with an application that uses large volumes of incoming data for further processing (multicast transport streams, to be specific).
The situation is as follows: a number of multicast streams are added. Each has it's own receiver thread that receives data from the socket, then copies it into a ringbuffer. It doesn't do any more.
At around 500 to 600 mbit, one particular CPU core reaches 100%. In fact, when initializing the streams and with increasing ethernet traffic I can see it almost linearly climbing towards that load.
The socket code uses the WSA overlapped API. Even when I reduce the threads to do only that (i.e. not copying to ringbuffer, which in turns reduces any load of the host application to near zero), I'm readily able to sink that particular core into the red. Also interesting is that this load is present on that particular core even if I, through the affinity settings, restrict it to 4 completely different ones. Which had me conclude the time is spent somewhere on OS or driver level.
I've tried gathering n datagrams at once before copying (i.e. more than 1500 byte MTU) but that only made matters worse. I've also checked that my socket is properly configured (non-blocking, return values are all okay).
I'd like to know if anyone can tell me something about this, perhaps had this problem or has some useful insights on how to handle these volumes of traffic efficiently on Windows.
(NIC I'm using: Intel PRO PT1000)
UPDATE
I've set up a little test application with only 1 goal: fetch incoming UDP from an arbitrary number of multicasts. I'm doing this with the IO completion port strategy like Len suggested. Now I can easily pull around 1Gbit from 28 multicasts with a fair CPU load (after all for now I'm not doing anything with the packets), but when using more (smaller bandwidth) multicasts, typically above 70 on this machine, the throughput gets worse and worse, and the worker threads seem unbalanced and mostly wasting their time (waiting).
The NIC interrupt load is not the limiting factor right now (it was before).
I'm quite new to this material, multithreaded networking stuff. The worker threads do nothing more than wait on the IO completion port (GetQueuedCompletionStatusEx()) w/INFINITE and then when a stream read completes, I immediately issue another one and loop (and if I can get a few more on that same stream synchronously, I'll take those without issuing new IO events, FILE_SKIP_COMPLETION_PORT_ON_SUCCESS).
I've got as many worker threads as I have CPU cores (anything (far) over makes things worse).
Didn't think this warranted a new question - but again, any help much appreciated!
Here's the source of my test app. (C++) - should be readable :-) http://pastebin.com/xWEPPbi6