CPU load high when receiving large(r) volumes of UDP traffic (Windows)

Question

I'm looking into a problem with an application that uses large volumes of incoming data for further processing (multicast transport streams, to be specific).

The situation is as follows: a number of multicast streams are added. Each has it's own receiver thread that receives data from the socket, then copies it into a ringbuffer. It doesn't do any more.

At around 500 to 600 mbit, one particular CPU core reaches 100%. In fact, when initializing the streams and with increasing ethernet traffic I can see it almost linearly climbing towards that load.

The socket code uses the WSA overlapped API. Even when I reduce the threads to do only that (i.e. not copying to ringbuffer, which in turns reduces any load of the host application to near zero), I'm readily able to sink that particular core into the red. Also interesting is that this load is present on that particular core even if I, through the affinity settings, restrict it to 4 completely different ones. Which had me conclude the time is spent somewhere on OS or driver level.

I've tried gathering n datagrams at once before copying (i.e. more than 1500 byte MTU) but that only made matters worse. I've also checked that my socket is properly configured (non-blocking, return values are all okay).

I'd like to know if anyone can tell me something about this, perhaps had this problem or has some useful insights on how to handle these volumes of traffic efficiently on Windows.

(NIC I'm using: Intel PRO PT1000)

UPDATE

I've set up a little test application with only 1 goal: fetch incoming UDP from an arbitrary number of multicasts. I'm doing this with the IO completion port strategy like Len suggested. Now I can easily pull around 1Gbit from 28 multicasts with a fair CPU load (after all for now I'm not doing anything with the packets), but when using more (smaller bandwidth) multicasts, typically above 70 on this machine, the throughput gets worse and worse, and the worker threads seem unbalanced and mostly wasting their time (waiting).

The NIC interrupt load is not the limiting factor right now (it was before).

I'm quite new to this material, multithreaded networking stuff. The worker threads do nothing more than wait on the IO completion port (GetQueuedCompletionStatusEx()) w/INFINITE and then when a stream read completes, I immediately issue another one and loop (and if I can get a few more on that same stream synchronously, I'll take those without issuing new IO events, FILE_SKIP_COMPLETION_PORT_ON_SUCCESS).

I've got as many worker threads as I have CPU cores (anything (far) over makes things worse).

Didn't think this warranted a new question - but again, any help much appreciated!

Here's the source of my test app. (C++) - should be readable :-) http://pastebin.com/xWEPPbi6

Did you expect your cpu to stay idle when creating new threads? — Kevin, Jan 08 '14 at 15:39
Not exactly (in fact I'd like to further reduce it to a fully async. system) but fact of the matter is that the CPU, nay 1-core, load is caused by the WSA receive call(s). Not by 40 threads. — nielsj, Jan 08 '14 at 15:43
That said, I can't find much conclusive information on it, but I do suspect there's a penalty to abusing Winsock from a "myriad" of threads instead of the approach I finally want anyway, a decent amount of workers asynchronously processing the IO for n streams. But that's easier said than done (legacy, application's design..) so I'll be trying that with in a small testbed first. And also tomorrow :) — nielsj, Jan 08 '14 at 15:52

Len Holgate · Accepted Answer · 2014-01-15T17:11:55.090

5

Take a look at your system with the SysInternals Process Explorer tool and see where this CPU is being used, it may well be allocated to "Interrupts" in which case it's the CPU which deals with the NIC interrupts. If this is the case then look at your NIC driver and see if you can enable or adjust interrupt coalescing so that the NIC generates fewer interrupts for the same number of datagrams.
See if you can offload datagram checksum calculation if it's not already offloaded to the NIC then CPU time on your computer will be being used. Note that there may be potential issues with non-paged pool usage if the NIC can't keep up and the driver doesn't ever throw any datagrams away (see this blog posting of mine).
Switch to using GetQueuedCompletionStatusEx(), you say you're using the "WSA overlapped API" hopefully you mean the I/O Completion Port method. If so then GetQueuedCompletionStatusEx() will allow you to read more datagrams with fewer system calls.
Switch to using the RIO API (see here for an introduction to the Windows Registered I/O Network Extensions). This continues the theme of 3 and provides more performance for getting the datagrams into your code.

Updated to reflect question update:

Issue multiple reads to start with to get a good backlog of pending reads. So, for example, have 100 pending reads and then start issuing new ones (a little more complex if you are using "skip completion port" processing but the idea is to build up a backlog.
Retrieve multiple completions from GQCSEx or there's no point in using it.
Avoid recursion when you get "inline" completions, prefer to loop. Otherwise you're chewing stack away.

edited Jan 15 '14 at 17:11

answered Jan 08 '14 at 17:31

Len Holgate

21,282
4
45
92

Thanks, good pointers! I verified that indeed that core was being used by interrupts. Checksum calc. was already offloaded to the NIC, disabling it (to CPU) gave a more distributed load, yet of course became problematic at about the same critical data rate. Can't use RIO (have to support older Windows versions), but will look into the completion ports plus a more sane threading model. – nielsj Jan 09 '14 at 12:52
You may also get some mileage out of using NIC teaming, a Smart Switch that can load balance across the multiple NICs, and then setting the NIC drivers to tie their interrupts to different processors (not tried this, should be possible). – Len Holgate Jan 09 '14 at 14:10
I've updated, or expanded if you will, the question. Any further pointers would be much appreciated. In the meanwhile I'll look into NIC teaming. – nielsj Jan 15 '14 at 15:09
Thanks! I teamed 2 PT1000s in a shuttle, but the interrupt load balancing doesn't seem to take nor is very configurable. Put a proper HP server in place for further testing tomorrow. I'll implement the code tips. Couldn't be happier with this kind of help. (oh and BTW, I had more events coming in via the Ex function but somehow it didn't seem to "help", that's why it is 1 right now, will tweak that back -- the problem is obviously elsewhere) – nielsj Jan 15 '14 at 19:45
Cant say I've had much success with NIC teaming and interrupts binding either, but then I haven't tried very hard (see this: http://blogs.technet.com/b/winserverperformance/archive/2008/03/18/networking-adapter-performance-guidelines.aspx for details of what should be possible). Make sure you always have more reads pending to try and keep more of your IOCP threads busy... – Len Holgate Jan 15 '14 at 20:44

score 2 · Answer 2 · answered Aug 12 '14 at 14:11

There are several things to check for high UDP multicast receive rates.

See if your NIC supports RSS (receive side scaling) by checking network control panel NIC properties, or powershell get-NetAdapterAdvancedProperties. If not, get a NIC that does, and configure some number of RSS queues, where the number is > 1 and <= the number of physical cores (not hyperthreaded cores). This will distribute the network processing in the kernel among multiple cores. You wrote that one core is pinned - if it is pinned in DPC time, (check perfmon ProcessorInformation DPC time %, then you need to use RSS.
Make sure you have enabled the maximum number of receive descriptors the NIC provides. The default value is too low for high speed multicast.
Make sure your receive socket buffer size is large enough. If it isn't you'll lose datagrams under heavy load due to insufficient buffering. Depending on the the receive volume and your programs ability to handle it, 10s of MB may be needed.
If you are running WS2012 or WIn8, look at the new performance counters, Microsoft BSP/Datagrams Dropped & Datagrams Dropped per second. If these are increasing under heavy receive load, you either need more socket buffering, or you program needs to run faster. Either way, you're dropping receive traffic.
If you are using a thread pool (either your own or one the O/S provides) be aware that this can re-order arrivals (bad for financial markets data), and you have to construct your program and thread model to avoid that.
Using GetQueuedCompletionStatusEx really doesn't help much for this kind of application. If you want to get multiple completions in a single fast call, that is what RIO was designed to do.

This information should allow you to process lots of multicast traffic. We have many customers who receive the full market data day in/day out without difficulty using regular sockets and an arrangement like this, Windows Server 2008 R2 and later. Windows Server 2012 can handle multicast traffic significantly faster, and if you add RIO into your program faster still. If you are running very old server versions (e.g. Windows Server 2003), you're not going to get as good performance. Part of that has to do with the evolution of the underlying hardware platform which didn't include things like message signalled interrupts which allowed us to get large multicast receive scaling gains using RSS.

I hope this helps.

Ed Briggs Microsoft Corp

CPU load high when receiving large(r) volumes of UDP traffic (Windows)

2 Answers2