Taking advantage of SSE and other CPU extensions

Question

Theres are couple of places in my code base where the same operation is repeated a very large number of times for a large data set. In some cases it's taking a considerable time to process these.

I believe that using SSE to implement these loops should improve their performance significantly, especially where many operations are carried out on the same set of data, so once the data is read into the cache initially, there shouldn't be any cache misses to stall it. However I'm not sure about going about this.

Is there a compiler and OS independent way writing the code to take advantage of SSE instructions? I like the VC++ intrinsics, which include SSE operations, but I haven't found any cross compiler solutions.
I still need to support some CPU's that either have no or limited SSE support (eg Intel Celeron). Is there some way to avoid having to make different versions of the program, like having some kind of "run time linker" that links in either the basic or SSE optimised code based on the CPU running it when the process is started?
What about other CPU extensions, looking at the instruction sets of various Intel and AMD CPU's shows there are a few of them?

@Matt - If you undelete your question at [Targetting Linux platforms with avx2 and non avx with one binary](https://stackoverflow.com/q/52086951/608639) we can probably provide a better answer for you. — jww, Aug 29 '18 at 22:58

score 7 · Answer 1 · answered Dec 12 '09 at 19:39

7

Good reading on the subject: Stop the instruction set war

Short overview: Sorry, it is not possible to solve your problem in simple and most compatible (Intel vs. AMD) way.

answered Dec 12 '09 at 19:39

Juraj

860
4
16

I'm a long-time fan of the site. Thanks for pointing out the blog. – Pascal Cuoq Dec 12 '09 at 19:44
1

Nice article, but I'm not really sure how it is relevant to the OP. He's not asking for binary compatibility between the latest versions of SSE. SSE1, 2 and 3 are 100% compatible as far as I know -- and of course, he's not proposing to write the binary code himself, so even if there were differences in AMD's and Intels implementations, it wouldn't really make a difference to him. – jalf Dec 13 '09 at 12:15
Of course he is not going to write assembler code by hand. He wants the compiler to generate code for him that would automatically use best available processor features. And in the middle of above article is clearly said that there is NOT such a thing. Support for SSE and non-SSE CPUs in one binary must be done manually. – Juraj Dec 13 '09 at 15:27
No, he didn't say "automatically use the best processor features". He asked for the best way to take advantage of *specific* ISA extensions. That's a different situation. He knows which instructions he'd like to use, and both Intel and AMD agree on *those* instruction. The problem here is not an "instruction set war", but that Intel has built certain cheap processors that don't support these instructions. This is a problem that he wants advice on how to solve. Telling him that "1) the problem exists and 2) it's all due to incompatibility between AMD and Intel" is 1) not helpful, and 2) wrong – jalf Dec 13 '09 at 15:36
This answer is mostly irrelevant and should be a comment. It does not matter whether Inel or AMD is defining the instructions. It has nothing to do with the question. – jww Aug 29 '18 at 22:53

score 7 · Accepted Answer · edited Feb 07 '16 at 22:35

7

For your second point there are several solutions as long as you can separate out the differences into different functions:

plain old C function pointers
dynamic linking (which generally relies on C function pointers)
if you're using C++, having different classes that represent the support for different architectures and using virtual functions can help immensely with this.

Note that because you'd be relying on indirect function calls, the functions that abstract the different operations generally need to represent somewhat higher level functionality or you may lose whatever gains you get from the optimized instruction in the call overhead (in other words don't abstract the individual SSE operations - abstract the work you're doing).

Here's an example using function pointers:

typedef int (*scale_func_ptr)( int scalar, int* pData, int count);


int non_sse_scale( int scalar, int* pData, int count)
{
    // do whatever work needs done, without SSE so it'll work on older CPUs

    return 0;
}

int sse_scale( int scalar, in pData, int count)
{
    // equivalent code, but uses SSE

    return 0;
}


// at initialization

scale_func_ptr scale_func = non_sse_scale;

if (useSSE) {
    scale_func = sse_scale;
}


// now, when you want to do the work:

scale_func( 12, theData_ptr, 512);  // this will call the routine that tailored to SSE 
                                    // if the CPU supports it, otherwise calls the non-SSE
                                    // version of the function

edited Feb 07 '16 at 22:35

BenMorel

34,448
50
182
322

answered Dec 12 '09 at 22:08

Michael Burr

333,147
50
533
760

So effectively, as long as I don't try to actually execute code containing unsupported instructions I'm fine, and I could get away with an "if(see2Supported){...}else{...}" type switch? – Fire Lancer Dec 12 '09 at 23:37
The thigns I'm thinking of rewriting with SSE, due to the effort required to do so are obviously expected to reap much benefits that make the overhead insignificant (think of a loop doing the same thing on every 32bits of a buffer at least a few thousand elements long). – Fire Lancer Dec 12 '09 at 23:39
You might be able to get away with a simple `if` branch, but I'd think you'll probably have to do something like compile to separate modules to make the compiler happy. But my thinking with function pointers is that you'd set them up to an appropriate routine at initialization and just call through them like regular functions - there would be no `if` conditionals at that point. – Michael Burr Dec 13 '09 at 03:01
2016 update: auto-vectorization support has come a long way. For some code that doesn't require too much in the way of complicated shuffling, you can make the SSE2, SSE4, and AVX versions all from the same scalar source, with `-O3 -ffast-math -mtune=nehalem -msse4.1` / `... -mtune=haswell -mavx2` or similar. (If you want to let the compiler use other things, like popcnt or BMI2, you can just use `-march=haswell`, but then make sure to check those CPU feature flags, too, when setting up the function pointers.) – Peter Cordes Feb 08 '16 at 04:53
The code above is close but won't work reliably because `non_sse_scale` and `sse_scale` are in the same file. Compiling the code will require `-msse2` and SSE2 could cross-pollinate into non-SSE code. I just had it happen to me on POWER8. GCC was generating function prologues that used POWER8 instructions for non-Alivec code paths. – jww Aug 29 '18 at 22:50

score 4 · Answer 3 · answered Dec 12 '09 at 19:59

4

The SSE intrinsics work with visual c++, GCC and the intel compiler. There is no problem to use them these days.

Note that you should always keep a version of your code that does not use SSE and constantly check it against your SSE implementation.

This helps not only for debugging, it is also usefull if you want to support CPUs or architectures that don't support your required SSE versions.

answered Dec 12 '09 at 19:59

Nils Pipenbrinck

83,631
31
151
221

So does this mean they all use the same intrinsic names for SSE with a common include or will I need to make my own include with a ton of defines/inline functions to map them? Also what about my 2nd point? – Fire Lancer Dec 12 '09 at 20:03
The SSE intrinsics are identical between these compilers as far as I know, so you shouldn't need any defines or other trickery. At worst (can't remember if this is the case), the name of the header file will be different. – jalf Dec 13 '09 at 10:56
2

They all use , AFAIK. – Tom Dec 13 '09 at 16:10
For a while the intrinsics were split into different headers for different SSE versions (with a letter you can only remember if you remember which Intel microarchitecture codename introduced that SSE version). The currently-recommended way is `` for all SSE/AVX. – Peter Cordes Feb 08 '16 at 04:43

jalf · Answer 4 · 2009-12-13T11:52:42.720

3

In answer to your comment:

So effectively, as long as I don't try to actually execute code containing unsupported instructions I'm fine, and I could get away with an "if(see2Supported){...}else{...}" type switch?

Depends. It's fine for SSE instructions to exist in the binary as long as they're not executed. The CPU has no problem with that.

However, if you enable SSE support in the compiler, it will most likely swap a number of "normal" instructions for their SSE equivalents (scalar floating-point ops, for example), so even chunks of your regular non-SSE code will blow up on a CPU that doesn't support it.

So what you'll have to do is most likely compile on or two files separately, with SSE enabled, and let them contain all your SSE routines. Then link that with the rest of the app, which is compiled without SSE support.

edited Dec 13 '09 at 11:52

answered Dec 13 '09 at 11:06

jalf

243,077
51
345
550

I was planning to for the few cases I think it really matters to use SSE intrinsics, which give SSE intructions even if /arch:SSE2 (for vc++) is not set right? Or will the compiler just "emulate" them if /arch:SSE2 isnt set? – Fire Lancer Dec 13 '09 at 13:05
Nah, it won't "emulate" them. But it might give you a compile error if you try to use those intrinsics without /arch:SSE2. I'm not sure though, I haven't tried it. But if it compiles, it'll do as you want. – jalf Dec 13 '09 at 13:18

score 1 · Answer 5 · answered Dec 14 '09 at 10:53

Rather than hand-coding an alternative SSE implementation to your scalar code, I strongly suggest you have a look at OpenCL. It is a vendor-neutral portable, cross-platform system for computationally intensive applications (and is highly buzzword-compliant!). You can write your algorithm in a subset of C99 designed for vectorised operations, which is much easier than hand-coding SSE. And best of all, OpenCL will generate the best implementation at runtime, to execute either on the GPU or on the CPU. So basically you get the SSE code written for you.

Theres are couple of places in my code base where the same operation is repeated a very large number of times for a large data set. In some cases it's taking a considerable time to process these.

Your application sounds like just the kind of problem that OpenCL is designed to address. Writing alternative functions in SSE would certainly improve the execution speed, but it is a great deal of work to write and debug.

Is there a compiler and OS independent way writing the code to take advantage of SSE instructions? I like the VC++ intrinsics, which include SSE operations, but I haven't found any cross compiler solutions.

Yes. The SSE intrinsics have been essentially standardised by Intel, so the same functions work the same between Windows, Linux and Mac (specifically with Visual C++ and GNU g++).

I still need to support some CPU's that either have no or limited SSE support (eg Intel Celeron). Is there some way to avoid having to make different versions of the program, like having some kind of "run time linker" that links in either the basic or SSE optimised code based on the CPU running it when the process is started?

You could do that (eg. using dlopen()) but it is a very complex solution. Much simpler would be (in C) to define a function interface and call the appropriate version of the optimised function via function pointer, or in C++ to use different implementation classes, depending on the CPU detected.

With OpenCL it is not necessary to do this, as the code is generated at runtime for the given architecture.

What about other CPU extensions, looking at the instruction sets of various Intel and AMD CPU's shows there are a few of them?

Within the SSE instruction set, there are many flavours. It can be quite difficult to code the same algorithm in different subsets of SSE when certain instructions are not present. I suggest (at least to begin with) that you choose a minimum supported level, such as SSE2, and fall back to the scalar implementation on older machines.

This is also an ideal situation for unit/regression testing, which is very important to ensure your different implementations produce the same results. Have a test suite of input data and known good output data, and run the same data through both versions of the processing function. You may need to have a precision test for passing (ie. the difference epsilon between the result and the correct answer is below 1e6, for example). This will greatly aid in debugging, and if you build in high-resolution timing to your testing framework, you can compare the performance improvements at the same time.

Just pointing out that calling via function pointer and different C++ implementation classes are actually the *same* solution. C++ virtual functions are implemented through use of function pointers. — Zan Lynx, Sep 02 '11 at 17:22

Taking advantage of SSE and other CPU extensions

5 Answers5

Linked