Where's the balance between thread amount and thread block times?

Question

Elongated question:
When having more blocking threads then CPU cores, where's the balance between thread amount and thread block times to maximize CPU efficiency by reducing context switch overhead?

I have a wide variety of IO devices that I need to control on Windows 7, with a x64 multi-core processor: PCI devices, network devices, stuff being saved to hard drives, big chunks of data being copied,... The most common policy is: "Put a thread on it!". Several dozen threads later, this is starting to feel like a bad idea.

None of my cores are being used 100%, and there's several cores who're still idling, but there are delays showing up in the range of 10 to 100ms who cannot be explained by IO blockage or CPU intensive usage. Other processes don't seem to require resources either. I'm suspecting context switch overhead.

There's a bunch of possible solutions I have:

Reduce threads by bundling the same IO devices: This mainly goes for the hard drive, but maybe for the network as well. If I'm saving 20MB to the hard drive in one thread, and 10MB in the other, wouldn't it be better to post it all to the same? How would this work in case of multiple hard drives?
Reduce threads by bundling similar IO devices, and increase it's priority: Dozens of threads with increased priority are probably gonna make my user interface thread stutter. But I can bundle all that functionality together in 1 or a couple of threads and increase it's priority.

Any case studies tackling similar problems are much appreciated.

score 3 · Accepted Answer · answered May 31 '11 at 08:17

3

First, it sounds like these tasks should be performed using asynchronous I/O (IO Completion Ports, preferably), rather than with separate threads. Blocking threads are generally the wrong way to do I/O.

Second, blocked threads shouldn't affect context switching. The scheduler has to juggle all the active threads, and so, having a lot of threads running (not blocked) might slow down context switching a bit. But as long as most of your threads are blocked, they shouldn't affect the ones that aren't.

answered May 31 '11 at 08:17

jalf

243,077
51
345
550

"Blocking threads are generally the wrong way to do I/O" Why? Using threads is the easiest way to achieve non-blocking I/O that I know. Asynchronous I/O is always a bit tricky to manage. – James Kanze May 31 '11 at 08:21
@James: because the reason for doing ASIO is typically related to performance, and using threads gives you pretty awful performance. (And let's not even get into all the usual threading bugs you risk getting into, making that too "a bit tricky to manage" :) – jalf May 31 '11 at 08:32
1

@Pieter: doesn't solve the performance problem. The correct solution, if you want reasonably efficient I/O, is to use ASIO. Like it or not. :) – jalf May 31 '11 at 08:59
@jalf The solutions for the problems with threading are widely known; in this case, a simple provider/user queue on both sides does the job perfectly. Managing asio is at least an order of magnitude more complex, and if the system is not completely broken, using threads will be just as fast. – James Kanze May 31 '11 at 11:14
Asynchronous IO seems like an effective way to handle IO and fits in nicely with task based threaded software. I got a follow up question on this: http://stackoverflow.com/questions/6201759/whats-the-difference-between-waitformultipleobjects-and-boostasio-on-multiple – Pieter Jun 01 '11 at 13:18
@James: no, using threads will not be just as fast, by far. Please look it up. Plenty of benchmarks have been made that show the difference quite clearly. Threads use more resources (stack space, and CPU time), have higher startup costs, introduce latency into the I/O operations (because nothing happens until the scheduler picks the right thread and allows it to run). I'm sorry, but stating something as the truth doesn't make it any more true. As for managing ASIO, I really don't see what the fuss is about. It requires a fair bit of boilerplate code, but it's not *complex* at all. Just verbose – jalf Jun 02 '11 at 08:17

score 1 · Answer 2 · answered May 31 '11 at 10:08

10-100ms with some cores idle: it's not context-switching overhead in itself since a switch is orders of magnitude faster than these delays, even with a core swap and cache flush.

Async I/O would not help much here. The kernel thread pools that implement ASIO also have to be scheduled/swapped, albeit this is faster than user-space threads since there are fewer Wagnerian ring-cycles. I would certainly head for ASIO if the CPU loading was becoming an issue, but it's not.

You are not short of CPU, so what is it? Is there much thrashing - RAM shortage? Excessive paging can surely result in large delays. Where is your page file? I've shoved mine off Drive C onto another fast SATA drive.

PCI bandwidth? You got a couple of TV cards in there?

Disk controller flushing activity - have you got an SSD that's approaching capacity? That's always a good one for unexplained pauses. I get the odd pause even though my 128G SSD is only 2/3 full.

I've never had a problem specifically related to context-swap time and I've been writing multiThreaded apps for decades. Windows OS schedules & despatches the ready threads onto cores reasonably quickly. 'Several dozen threads' in itself, (ie. not all running!), is not remotely a problem - looking now at my TaskManger/performance, I have 1213 threads loaded on and no performance issues at all with ~6% CPU usage, (app on test running in background, bitTorrent etc). Firefox has 30 threads, VLC media player 27, my test app 23. No problem at all writing this post.

Given your issue of 10-100ms delays, I would be amazed if fiddling with thread priorities and/or changing the way your work is loaded onto threads provides any improvement - something else is stuffing your system, (you haven't got any drivers that I coded, have you? :).

Does perfmon give any clues?

Rgds, Martin

score 0 · Answer 3 · answered May 31 '11 at 08:35

0

I don't think that there is a conclusive answer, and it probably depends on your OS as well; some handle threads better than others. Still, delays in the 10 to 100 ms range are not due to context switching itself (although they could be due to characteristics of the scheduling algorithm). My experience under Windows is that I/O is very inefficient, and if you're doing I/O, of any type, you will block. And that I/O by one process or thread will end up blocking other processes or threads. (Under Windows, for example, there's probably no point in having more than one thread handle the hard drive. You can't read or write several sectors at the same time, and my impression is that Windows doesn't optimize accesses like some other systems do.)

With regards to your exact questions:

"If I'm saving 20MB to the hard drive in one thread, and 10MB in the other, wouldn't it be better to post it all to the same?": It depends on the OS. Normally, there should be no reduction in time or latency using separate threads, and depending on other activity and the OS, there could be an improvement. (If there are several disk requests in instance, most OS's will optimize the accesses, reordering the requests to reduce head movement.) The simplest solution would be to try both, and see which works better on your system.

"How would this work in case of multiple hard drives?": The OS should be able to do the I/O in parallel, if the requests are to different drives.

With regards to increasing priority of one or more theads, it's very OS dependent, but probably worth trying. Unless there's significant CPU time used in the threads with the higher priority, it shouldn't impact the user interface—these threads are mostly blocked for I/O, remember.

answered May 31 '11 at 08:35

James Kanze

150,581
18
184
329

"my impression is that Windows doesn't optimize accesses like some other systems do" - isn't the whole reason for NCQ (Native Command Queueing) that the hard disk itself is in a better position to optimize access, since it's driven by disk geometry? Windows certainly can support drives with NCQ enabled, e.g. issue multiple reads and deal with the results out of order. – MSalters May 31 '11 at 09:23
@MSalters Things have changed since I worked at that level, but there are still various optimization strategies which can be used, depending on the hard drive (which makes it more difficult for the system). The fact remains that running on the same hardware, disk IO is significantly faster under Linux than under Windows. – James Kanze May 31 '11 at 11:12
I find it interesting that the person claiming that using threading to do "asnychronous" I/O is just as efficient as the proper ASIO API's, is also, based on his threading-based I/O experience, deriding Windows for being inefficient at I/O. Has it occurred to you that the problem might be your insistence on using the wrong tool for the job? – jalf Jun 02 '11 at 08:19
The big issue with I/O under Windows is that 95% of the time, a virus scanner will insist to have a second look. And those are of wildly varying quality. It's to be expected that a virus scanner exists that would serialize all I/O to a single thread, neutering any attempts to use proper threading. If you look at the different TPC benchmarks (where sufficient money is spent to avoid trivial setup errors) there's no clear advantage for either OS. – MSalters Jun 07 '11 at 21:37
@MSalters It's quite possible that much of the difference is due to the virus scanner. I know that `copy`ing a large file takes far more time on my Windows machines than `cp`ing a large file on my Linux machines, and that this difference is present on dual boot machines, so it isn't due to different hardware. But the difference is more notable on my machines at work than the machine I have at home, so the performance does vary even across Windows machines. (The machines at work occasionally hang for seconds at a time, which I've not seen elsewhere.) – James Kanze Jun 08 '11 at 07:28

score 0 · Answer 4 · answered May 31 '11 at 10:16

Well, my Windows 7 is currently running 950 threads. I don't think that adding another few dozen on would make a significant difference. However, you should definitely be looking at a thread pool or other work-stealing device for this - you shouldn't make new threads just to let them block. If Windows provides asynchronous I/O by default, then use it.

Where's the balance between thread amount and thread block times?

4 Answers4

Linked