CPU Cores not Utilized properly using QThreads

Question

Using : C++ (MinGW), Qt4.7.4, Vista (OS), intel core2vPro

I need to process 2 huge files in exactly the same way. So i would like to call the processing routine from 2 separate threads for 2 separate files. The GUI thread does nothing heavy; just displays a label and runs an event loop to check for emission of thread termination conditions and quits the main Application accordingly. I expected this to utilize the two cores (intel core2) somewhat equally, but on the contrary i see from Task Manager that one of the cores is highly utilized and the other is not (though not every time i run the code); also the time taken to process the 2 files is much more than the time taken to process one file (i thought it should have been equal or a little more but this is almost equal to processing the 2 files one after another in a non-threaded application). Can i somehow force the threads to use the cores that i specify?

QThread* ptrThread1=new QThread;
QThread* ptrThread2=new QThread;
ProcessTimeConsuming* ptrPTC1=new ProcessTimeConsuming();
ProcessTimeConsuming* ptrPTC2=new ProcessTimeConsuming();

ptrPTC1->moveToThread(ptrThread1);
ptrPTC2->moveToThread(ptrThread2);

//make connections to specify what to do when processing ends, threads terminate etc
//display some label to give an idea that the code is in execution

ptrThread1->start();
ptrThread2->start(); //i want this thread to be executed in the core other than the one used above

ptrQApplication->exec(); //GUI event loop for label display and signal-slot monitoring

Are the files on separate physical hard drives? If you're trying to spin rust to read two files at once then you have to seek between them each time a different thread gets scheduled, and that part will swamp anything you might gain from the CPU. — Pete Kirkham, Mar 26 '12 at 13:08

Tudor · Accepted Answer · 2012-03-26T15:50:53.750

17

Reading in parallel from a single mechanical disk often times (and probably in your case) will not yield any performance gain, since the mechanical head of the disk needs to spin every time to seek the next reading location, effectively making your reads sequential. Worse, if a lot of threads are trying to read, the performance may even degrade with respect to the sequential version, because the disk head is bounced to different locations of the disk and thus needs to spin back where it left off every time.

Generally, you cannot do better than reading the files in a sequence and then processing them in parallel using perhaps a producer-consumer model.

edited Mar 26 '12 at 15:50

answered Mar 26 '12 at 13:11

Tudor

61,523
12
102
142

i see..can you however tell me how to force a thread on a core of choice? – ustulation Mar 26 '12 at 13:15
@ustulation: `qthread` does not provide such an affinity API. Furthermore, you almost never need to set the affinity to separate cores because the scheduler will set the thread-to-CPU mapping as best as possible. Anyway, if you really need to, see this post about using calls to the pthread library to achieve this: http://qt-project.org/faq/answer/setting_thread_processor_affinity_in_linux – Tudor Mar 26 '12 at 13:22
thanks..i guess i need to stick with sequential processing because the file size is prohibitive for reading into RAM followed by processing..on a separate note, i have never used `boost threads`..could you give me a hint on whether it has this feature (core Affinity API) so that i delve into it when i need one? – ustulation Mar 26 '12 at 13:34
@ustulation: I think you didn't understand the answer. **You don't need threads.** Therefore, you don't need thread affinity either. If you need more performance, buy an SSD instead. – MSalters Mar 26 '12 at 14:17
@ustulation: As far as I know boost does not offer this functionality either. In fact, higher-level threading libraries (boost, tbb, qthread) tend to avoid this kind of low level mechanisms because it creates portability problems. If you need explicit affinity on linux just use classic pthreads. – Tudor Mar 26 '12 at 14:21
@MSalters: you didn't understand me either...i said **on a separate note** – ustulation Mar 26 '12 at 14:27
-1: You can't say for sure. See http://en.wikipedia.org/wiki/I/O_scheduling and e.g. http://en.wikipedia.org/wiki/Elevator_algorithm . See also http://en.wikipedia.org/wiki/Fragmentation_%28computer%29#Data_fragmentation . So having multiple threads to read "huge files" (sic) _can_ actually _increase_ performance, never mind if you are on a multicore CPU or not. – Sebastian Mach Mar 26 '12 at 14:44
@phresnel: wow..interesting read..but am really not experiencing this in terms of speed gains (wonder if my OS actually does this) – ustulation Mar 26 '12 at 14:58
@phresnel: I wouldn't be so fast to downvote. The reading you linked does not suggest that it is possible to do better than the sequential time. What the Elevator algorithm does is group the reads in the direction of the head movement, which gets you "close" to the sequential read pattern, and thus to the speed of a sequential read. Essentially it tries to reduce the degradation from concurrent reads, but cannot do better than a single sequential read. – Tudor Mar 26 '12 at 15:27
@phresnel: Assume two threads emit reads, each for half the data that a single thread would emit. These reads would be scheduled according to the Elevator algorithm in order to optimize the disk seek time, but you can easily see that they cannot beat a single sequential read, that is by definition optimal. – Tudor Mar 26 '12 at 15:31
In fact, if each thread would read from a separate section of the disk then if the disk starts seeking to service one of them, it will not return on the other side, since other incoming reads will only be serviced in the "forward" direction until the head reaches the edge. This basically leads to both threads having their reads serviced one after the other. – Tudor Mar 26 '12 at 15:38
@Tudor: You are not wrong per se, and I am not right per se (tho I wrote 'can't say for sure'). Imagine you have huge files A and B. These are spread all along the hard disk. Process 1 read A, process 2 reads B. Now, process 1 reads a chunk of A that resides not far away off a B-chunk next to be read by process 2. The next A-chunk is far away. Strictly singe-threaded would mean that process 1 now has to wait a few moments until the disk is ready for the next chunk. And after the end of this, process 2 can read B, with the same disk-waits ... – Sebastian Mach Mar 26 '12 at 15:42
@phresnel: It probably depends on the disk fragmentation level. If the files are mostly continuous on disk, the a sequential read is best, but if the file is very fragmented, I agree that things are not so easy to evaluate. – Tudor Mar 26 '12 at 15:44
@Tudor: ... but multi-threaded, while process 1 has to wait anyways, process 2 can read the nearby chunk now, at the same time advancing the read position further. Of course you are perfectly right that if the disk _can physically be read sequentially_, performance _will_ suffer, but often, huge files are fragmented. While you might not get full utilisation of multiple cores, you might get even less utilisation with a single thread. But however, this is all very vague, and depends on stars and magnetic fields and other lucky circumstances. – Sebastian Mach Mar 26 '12 at 15:45
@Tudor: As a final note: I am not really downvoting your explanation, but rather the blanketness of the very first phrase ;) And of course, if there's heavy CPU-side crunching between reading the chunks, multithreading wins again some meters. (Related to the latter: It is valuable to use more compiler processes for building than you have physical cores; for some programs this is `make -j8`, even tho I employ a quadcore) – Sebastian Mach Mar 26 '12 at 15:47
@phresnel: In fact, I was pondering while writing it that probably I've been a bit too harsh in saying that it definitely won't bring any gain. Probably I should relax it to something like "probably won't", to accommodate for exceptions like what you mentioned. In the absence of the OP's method code, this is probably what everyone assumes right away. :) – Tudor Mar 26 '12 at 15:48
@phresnel: Thanks. It's good that you brought this point up though. It's always worth to consider multiple possibilities. – Tudor Mar 26 '12 at 16:08
@Tudor: Upon discussion with you, I had some enlightenments, too; and even tho (in retrospective) I sometines sound a bit harsh, I never mean it that way :) – Sebastian Mach Mar 27 '12 at 05:43
@phresnel: It's ok, I didn't take it hard. I learned some things too. :) – Tudor Mar 27 '12 at 06:41
Of course you can "do better" in the sense of using less memory than reading everything sequentially if the files are bug. You can and should explicitly control the overhead of seeking between files - implement your own round-robin reader. It's not magic, the physical constraints are rather simple. – Kuba hasn't forgotten Monica Aug 22 '13 at 00:18

score 2 · Answer 2 · answered Aug 22 '13 at 00:13

With mechanical hard drives, you need to explicitly control the ratio of time spent doing sequential reads vs. time spent seeking. The canonical way of doing it is with n+m objects running on m+min(n, QThread::idealThreadCount()) threads. Here, m is the number of hard drives that the files are on, and n is the number of files.

Each of m objects is reading files from given hard drive in a round robin fashion. Each read must be sufficiently large. On modern hard drives, let's budget 70Mbytes/s of bandwidth (you can benchmark the real value), 5ms for a seek. To waste at most 10% of the bandwidth, you only have 100ms or 100ms/(5ms/seek)=20 seeks per second. Thus you must read at least 70Mbytes/(20seeks+1)=3.3 Megabytes from each file before reading from the next file. This thread fills a buffer with file data, and the buffer then signals the relevant computation object that is attached to the other side of the buffer. When a buffer is busy, you simply skip reading from given file until the buffer becomes available again.
The other n objects are computation objects, they perform a computation upon a signal from a buffer that indicates the buffer is full. As soon as the buffer data is not needed anymore, the buffer is "reset" so that the file reader can refill it.

All reader objects need their own threads. The computation objects can be distributed among their own threads in a round-robin fashion, so that the threads all have within +1, -0 objects of each other.

score 1 · Answer 3 · answered Feb 21 '13 at 18:57

1

I thought my empirical data might be of some use to this discussion. I have a directory with 980 txt files that I would like to read. In the Qt/C++ framework and running on an Intel i5 quad core, I created a GUI Application and added a class worker to read a file given its path. I pushed the worker into a thread, then repeated adding an additional thread each run. I timed roughly 13 mins with 1 thread, 9 minutes with 2, and 8 minutes with 3. So, in my case there was some benefit, but it degraded quickly.

answered Feb 21 '13 at 18:57

Kevin White

173
3
6

Any system that lets a single thread exhaust its read/write capacity will be unstable - or at the least unresponsive to a user. Unless you go out of your way, threads effectively have their IOs throttled. What you did was demand that your program have more priority by launching two threads. – Mikhail May 16 '13 at 23:14
Everything depends on the size of the files. As a rule of thumb on mechanical hard drives, if you want the overheads to be under 10%, you must read a couple megabytes at a time. So if the files are smaller than, say, 2Mbytes, you read them in the entirety. If they are bigger, then you can round-robin between files to keep more computation threads busy. – Kuba hasn't forgotten Monica Aug 22 '13 at 00:20

CPU Cores not Utilized properly using QThreads

3 Answers3