6

Background

I have an EP (Embarassingly Parallell) C application running four threads on my laptop which contains an intel i5 M 480 running at 2.67GHz. This CPU has two hyperthreaded cores.

The four threads execute the same code on different subsets of data. The code and data have no problems fitting in a few cache lines (fit entirely in L1 with room to spare). The code contains no divisions, is essentially CPU-bound, uses all available registers and does a few memory accesses (outside L1) to write results on completion of the sequence.

The compiler is mingw64 4.8.1 i e fairly recent. The best basic optimization level appears to be -O1 which results in four threads that complete faster than two. -O2 and higher run slower (two threads complete faster than four but slower than -O1) as does -Os. Every thread on average does 3.37 million sequences every second which comes out to about 780 clock cycles for each. On average every sequence performs 25.5 sub-operations or one per 30.6 cycles.

So what two hyperthreads do in parallell in 30.6 cycles one thread will do sequentially in 35-40 or 17.5-20 cycles each.

Where I am

I think what I need is generated code which isn't so dense/efficient that the two hyperthreads constantly collide over the local CPU's resources.

These switches work fairly well (when compiling module by module)

-O1 -m64 -mthreads -g -Wall -c -fschedule-insns

as do these when compiling one module which #includes all the others

-O1 -m64 -mthreads -fschedule-insns -march=native -g -Wall -c -fwhole-program

there is no discernible performance difference between the two.

Question

Has anyone experimented with this and achieved good results?

Olof Forshell
  • 3,169
  • 22
  • 28
  • There is no single answer to this. As you yourself noticed the higher optimization (`-O2`) gives worse performance *in your case* than the lower optimization (`-O1`). (Don't forget to check e.g. `-O3` as well.). It depends very much on your code and your use-cases. You simply have to experiment and benchmark. – Some programmer dude Apr 09 '14 at 09:46
  • I've experimented and benchmarked A LOT :-) It's more wondering if I've overlooked anything obvious or obscure. – Olof Forshell Apr 09 '14 at 10:44
  • Can you post some code and show how you're profiling your code? – Z boson Apr 09 '14 at 14:53
  • Are you using OpenMP? – Z boson Apr 09 '14 at 14:56
  • @Z boson: when I profiled on my other laptop I used AMD's CodeAnalyst. My program runs on Windows 7 and uses the thread functions there without wrappers/frameworks. – Olof Forshell Apr 11 '14 at 07:23

4 Answers4

1

You say "I think what I need is generated code which isn't so dense/efficient that the two hyperthreads constantly collide over the local CPU's resources.". That's rather misguided.

Your CPU has a certain amount of resources. Code will be able to use some of the resources, but usually not all. Hyperthreading means you have two threads capable of using the resources, so a higher percentage of these resources will be used.

What you want is to maximise the percentage of resources that are used. Efficient code will use these resources more efficiently in the first place, and adding hyper threading can only help. You won't get that much of a speedup through hyper threading, but that is because you got the speedup already in single threaded code because it was more efficient. If you want bragging rights that hyper threading gave you a big speedup, sure, start with inefficient code. If you want maximum speed, start with efficient code.

Now if your code was limited by latencies, it means it could perform quite a few useless instructions without penalty. With hyper threading, these useless instructions actually cost. So for hyper threading, you want to minimise the number of instructions, especially those that were hidden by latencies and had no visible cost in single threaded code.

gnasher729
  • 51,477
  • 5
  • 75
  • 98
  • I added some additional statistics. What I meant when saying "what I need" is generated code so that two identical code sequences eventually achieve some sort of execution interleaving so that they collide to a lesser extent when accessing the CPU's resources. – Olof Forshell Apr 09 '14 at 09:54
  • Hyperthreading _is_ a mechanism to optimize resource usage when your compiler doesn't do a good job regarding pipeline utilization with in-order execution, i.e. if one thread needs to wait (resulting in a pipeline stall) because of a cache miss. That is, it will be most efficient with optimization levels that don't do aggresive instruction reordering (which tries to prevent this). If your compiler does a good job already to avoid pipeline stalls (also depends on the nature of the processing job), hyperthreading will probably not help much. – mfro Apr 09 '14 at 10:05
  • *What you want is to maximise the percentage of resources that are used* I beg to differ. What OP wants, surely, is to maximise the speed of computation. If speed can be boosted at the expense of efficiency, so what ? – High Performance Mark Apr 09 '14 at 10:08
  • If a single thread uses _all_ available resources, then hyper threading doesn't speed it up, but that's because the CPU already does all the work it can, so that's not a bad thing. – gnasher729 Apr 09 '14 at 10:11
  • @HighPerformanceMark: If you execute n instructions, then the percentage of resources used will directly improve the speed. Instruction scheduling is meant to improve speed by increasing resource usage. The OP asked for less efficient code (low resource usage) so that hyper threading can improve resource usage, but that would just get a higher speedup from hyper threading because the single threaded code runs slower than needed. – gnasher729 Apr 09 '14 at 10:15
  • @mfro: Code usually has unavoidable stalls. For example, if you read a 100 MB array, that will take some time no matter what you do with the data; if you add floating-point numbers in a loop, that loop cannot run faster than the latency of a floating-point add. Badly optimised code will have many unnecessary instructions that don't hurt as long as you don't exceed these latencies. But with hyper threaded code, you execute twice as much code because you have two threads, so suddenly these unnecessary instructions hurt. Scheduling is mostly done by the processor. – gnasher729 Apr 09 '14 at 10:19
  • @gnasher729: it may seem a pointless excercise to optimize for hyper-threading but that machine is what I have to work with. I have a slower, 2-core AMD laptop which is of little use in this context. I'm trying to find a development kit with a many-core CPU to spread the work further but no luck yet. – Olof Forshell Apr 09 '14 at 10:28
1

You could try locking each thread to a core using processor affinity. I've heard this can give you 15%-50% improved efficiency with some code. The saving being that when the processor context switch happens there is less changed in the caches etc.. This will work better on a machine that is just running your app.

AnthonyLambert
  • 8,768
  • 4
  • 37
  • 72
  • I do use this. Because my application is so CPU-bound the execution performed by the OS is negligible. – Olof Forshell Apr 09 '14 at 10:25
  • 1
    Without processor affinity my app runs at 50% usage on every core. With processor affinity it runs at 100 and requires half the time to complete. – Olof Forshell Apr 09 '14 at 10:37
  • I also run it at "below normal" priority so as not to starve other programs of CPU. It essentially takes everything that's available when the other applications have taken what they need. Over a run of a minute or so the app uses more than 99.8% of CPU available. – Olof Forshell Apr 09 '14 at 10:41
  • After a certain point your better off spending person $'s on buying a faster machine... an i7 which will plug into yr box and will double yr speed is $300-$400... – AnthonyLambert Apr 09 '14 at 11:27
  • I suppose I could buy a new machine. Even if I did I'd still want my application to take on bigger and bigger tasks. So I guess you could say that one does not rule out the other. But upgrading the i5 in my laptop to an i7 is interesting. Where do I find out more? – Olof Forshell Apr 09 '14 at 12:07
  • http://en.wikipedia.org/wiki/Intel_Core look up your existing i5 chip. Then find the i7 equivalent that uses the same socket etc. I don't recommend going for a newer generation chip as your bios won't support it (typically). You can usually buy i7's cheaper on eBay. – AnthonyLambert Apr 09 '14 at 13:24
  • 1
    @OlofForshell, you just tried processor affinity and it made you result twice as fast? You should add that information to your question. That's useful information! – Z boson Apr 09 '14 at 14:56
  • @Z boson: affinity was implemented very early so I forget to mention it. – Olof Forshell Apr 11 '14 at 07:37
0

It's possible that hyperthreading be counterproductive. It happens it is often counterproductive with computationally intensive loads.

I would give a try to:

  • disable it at bios level and run two threads
  • try to optimize and use vector SSE/AVX extensions, eventually even by hand

explanation: HT is useful because hardware threads get scheduled more efficiently that software threads. However there is an overhead in both. Scheduling 2 threads is more lightweight than scheduling 4, and if your code is already "dense", I'd try to go for "denser" execution, optimizing as more as possible the execution on 2 pipelines.

It's clear that if you optimize less, it scales better, but difficulty it will be faster. So if you are looking for more scalability - this answer is not for you... but if you are looking for more speed - give it a try.

As others has already stated, there is not a general solution when optimizing, otherwise this solution should be embedded in the compilers already.

Sigi
  • 4,826
  • 1
  • 19
  • 23
  • My BIOS does not allow disabling of hyper-threading. I will look into the AVX extensions. – Olof Forshell Apr 09 '14 at 10:43
  • That's a pity... apart from desoldering some processor pins or installing a modded bios (the first is a joke, but I did install a modded AMI BIOS on my core2 laptop), you can try to just use 1 thread locked on each core. – Sigi Apr 09 '14 at 10:57
  • As I said previously I can get four hyperthreads to run a bit faster than one thread on each core. Not by a whole lot which is the reason for my posting here. – Olof Forshell Apr 09 '14 at 12:09
0

You could download an OpenCL or CUDA toolkit and implement a version for your graphic card... you maybe able to speed it up 100 fold with little effort.

AnthonyLambert
  • 8,768
  • 4
  • 37
  • 72