1

Is increased CPU time (as reported by time CLI command) indicative of inefficiency when hyperthreading is used (e.g. time spent in spinlocks or cache misses) or is it possible that the CPU time is inflated by the odd nature of HT? (e.g. real cores being busy and HT can't kick in)

I have quad-core i7, and I'm testing trivially-parallelizable part (image to palette remapping) of an OpenMP program — with no locks, no critical sections. All threads access a bit of read-only shared memory (look-up table), but write only to their own memory.

 cores real CPU
  1:   5.8  5.8
  2:   3.7  5.9
  3:   3.1  6.1
  4:   2.9  6.8
  5:   2.8  7.6
  6:   2.7  8.2
  7:   2.6  9.0
  8:   2.5  9.7

I'm concerned that amount of CPU time used increases rapidly as number of cores exceeds 1 or 2.

I imagine that in an ideal scenario CPU time wouldn't increase much (same amount of work just gets distributed over multiple cores).

Does this mean there's 40% of overhead spent on parallelizing the program?

Kornel
  • 97,764
  • 37
  • 219
  • 309
  • Don't forget that Hyperthreaded cores aren't "real" cores. So it's expected for CPU time to go up. – Mysticial Mar 12 '13 at 00:56
  • 1
    I would be more concerned that the amount of CPU time **does not** increase rapidly, indicating cores being idle. In the ideal case CPU time would be `cores` times `real` (with `cores` up to `4`, before HT kicks in). You efficiency drops to 65% with 3 cores already. See [Amdahl's law](http://en.wikipedia.org/wiki/Amdahl%27s_law) on how the amount of non-parallel parts in program affect its scalability. Also probe if the problem is memory bound. – Hristo Iliev Mar 12 '13 at 07:04
  • It's also possible you've coded the "trivially parallelizable" section badly. It's really easy to get parallel computing wrong and end up wasting performance, resulting in suboptimal scaling. But 40% sounds absurd - I'm easily getting 95% efficiency with less scalable algorithms, so there's definitely something at play here. – Thomas Mar 14 '13 at 06:55

2 Answers2

1

Quick question - are you running the genuine time program /usr/bin/time, or the built in bash command of the same name? I'm not sure that matters, they look very similar.

Looking at your table of numbers I sense that the processed data set (ie input plus all the out data) is reasonably large overall (bigger than L2 cache), and that the processing per data item is not that lengthy.

The numbers show a nearly linear improvement from 1 to 2 cores, but that is tailing off significantly by the time you're using 4 cores. The hyoerthreaded cores are adding virtually nothing. This means that something shared is being contended for. Your program has free running threads, so that thing can only be memory (L3 cache and main memory on the i7).

This sounds like a typical example of being I/O bound rather than compute bound, the I/O in this case being to/from L3 cache and main memory. L2 cache is 256k, so I'm guessing that the size of your input data plus one set of results and all intermediate arrays is bigger than 256k.

Am I near the mark?

Generally speaking when considering how many threads to use you have to take shared cache and memory speeds and data set sizes into account. That can be a right be a right bugger because you have to work it out at run time, which is a lot of programming effort (unless your hardware config is fixed).

bazza
  • 7,580
  • 15
  • 22
  • `/usr/bin/time` is same. I like the concept, but I don't know where/if I'm hitting the limit. The [remapping](https://github.com/pornel/improved-pngquant/blob/lib/lib/libimagequant.c#L660) uses lookup table (a jagged array to be precise) that's ~20KB in size. Whole image may be MBs in size, but it's accessed linearly (`*dest++ = lookup(*src++)`). Can you recommend technique/tool that could detect contention? – Kornel Mar 14 '13 at 15:58
  • Intel's VTune I think can tell you a lot about what's going on inside your program (costs money though). However I think you've answered it with "Whole image may be MB in size". That's too big for the L2 caches. You're operating on it small chunks at a time? Well, even if each chunk fits inside L2 that cache will still have to flush the results out to L3 to make room for new results chunks. With all 4 L2 caches trying to do that all at once the L3 becomes a bottleneck. And if you overflow L3 then its back out to main memory. – bazza Mar 14 '13 at 17:36
  • You can improve the situation if you can think of more things to do on each chunk whilst it is in L2 / L1 cache. That is if after the remapping you then do some other computation on the image then perhaps elements of that can be done on a chunk immediately after having remapped that chunk. This would bring the balance back from being I/O bound (your current situation) to compute bound. – bazza Mar 14 '13 at 17:50
  • If there is not more processing to be done on the image after the remap then you're kind of stuck. Adding a second CPU could help because then you'd have two L3 caches (and two memory systems, important if the image is > L3 cache size of 8MB). Weirdly this could mean that two 2 core i5's could be faster in this situation than 1 i7. – bazza Mar 14 '13 at 18:00
  • In my view this is a good example of where Intel's philosophy is at odds with what you (and I in my job) are trying to do. Intel design their chips around the assumption that generally speaking people don't write multi-threaded applications but do run multiple single-threaded applications. It's a fair assumption on their part, and gives good performance. However that costs people like us! You could perhaps look at CUDA / OpenCL and get the GPU to do the work for you; that's be good fun :) They're very good at chomping through large lumps of data doing the same thing to every part of it. – bazza Mar 14 '13 at 18:07
1

It's quite possibly an artefact of how CPU time is measured. A trivial example, if you run a 100 MHz CPU and a 3 GHz CPU for one second each, each will report that it ran for one second. The second CPU might do 30 times more work, but it takes one second.

With hyperthreading, a reasonable (not quite accurate) model would be that one core can run either one task at lets say 2000 MHz, or two tasks at lets say 1200 MHz. Running two tasks it does only 60% of the work per thread, but 120% of the work for both threads together, a 20% improvement. But if the OS asks how many seconds of CPU time was used, the first will report "1 second" after each second on real time, while the second will report "2 seconds".

So the reported CPU time goes up. If it less than doubles, overall performance is improved.

gnasher729
  • 51,477
  • 5
  • 75
  • 98