How do modern compilers use mmx/3dnow/sse instructions?

Question

I've been reading up on the x86 instruction set extensions, and they only seem useful in some quite specific circumstances (eg HADDPD - (Horizontal-Add-Packed-Double) in SSE3). These require a certain register layout that needs to be either deliberately set up, or occur from the series of instructions before it. How often do general-purpose compilers like gcc actually use these instructions (or a subset thereof), or are they mainly to be used in hand-coded assembler? How does the compiler detect where it is appropriate to use SIMD instructions?

jalf · Accepted Answer · 2012-09-06T07:47:48.117

Generally, few compilers use them. GCC and Visual Studio arn't usually able to use the SIMD instructions. If you enable SSE as a compiler flag, it will use the scalar SSE instructions for regular floating-point operations, but generally, don't expect the vectorized ones to be used automatically. Recent versions of GCC might be able to use them in some cases, but didn't work last I tried. Intel's C++ compiler is the only big compiler I know of that is able to auto-vectorize some loops.

In general though, you'll have to use them yourself. Either in raw assembler, or by using compiler intrinsics. In general, I'd say intrinsics are the better approach, since they better allow the compiler to understand the code, and so schedule and optimize, but in practice, I know MSVC at least doesn't always generate very efficient code from intrinsics, so plain asm may be the best solution there. Experiment, see what works. But don't expect the compiler to use these instructions for you, unless you 1) use the right compiler, and 2) write fairly simple loops that can be trivially vectorized.

Update 2012
Ok, so three years have passed since I wrote this answer. GCC has been able to auto-vectorize (simple) code for a couple of years now, and in VS2012, MSVC finally gains the same capability. Of course, the main part of my answer still applies: compilers can still only vectorize fairly trivial code. For anything more complex, you're stuck fiddling with intrinsics or inline asm.

Has intrinsics become (a lot) better the last few years? Last time I checked, both MSVC and ICC had quite lousy register allocation, and even I was easily able to beat the compiler-intrinsic version with hand-coded assembly. — snemarch, Sep 11 '09 at 09:03
I believe recent versions of MSVC have made *some* improvements to intrinsics-generated code. But I don't know how much difference that has made. — jalf, Sep 11 '09 at 12:23
MSVC's output for scalar SSE is still just terrible, especially if you use an intrinsic anywhere. — Crashworks, Oct 23 '09 at 00:27

score 9 · Answer 2 · answered May 18 '09 at 01:16

Mono can use SIMD extensions as long as you use its classes for vectors. You can read about it here: http://tirania.org/blog/archive/2008/Nov-03.html

GCC should do some automatic vectorisation as long as you're using -O3 or a specific flag. They have an info page here: http://gcc.gnu.org/projects/tree-ssa/vectorization.html

score 4 · Answer 3 · answered May 18 '09 at 01:37

4

The question of how to exploit SSE and other small vector units automatically (without direction from the programmer in the form of special language constructs or specially blessed compiler "intrinsics") has been a topic of compiler research for some time. Most results seem to be specialized to a particular problem domain, such as digital signal processing. I have not kept up with the literature on this topic, but what I have read suggests that exploiting the vector (SSE) unit is still a topic for research, and that one should have low expectations of general-purpose compilers commonly used in the field.

Suggested search term: vectorizing compiler

answered May 18 '09 at 01:37

Norman Ramsey

198,648
61
360
533

2020 update: the major ahead-of-time compilers (not JITs) can fairly reliably vectorize simple "vertical" operations where the loop body accesses some arrays all with the same index. Like `A[i] = B[i] * x + C[i]` or whatever, for integer or FP. Not gathers or scatters like `A[idx[i]]`. With arrays of different type-widths, or any shuffling or structs, or any more complicated stuff like a serial dependency (e.g. prefix sum) you still often need manual vectorization for best results. – Peter Cordes Feb 19 '20 at 02:05
Some compilers can even vectorize math library functions like `log` or `exp`, but fast SIMD approximations can be a big win, e.g. if you know you don't care about handling NaN or Inf inputs, and you can accept lower precision output. – Peter Cordes Feb 19 '20 at 02:06

score 1 · Answer 4 · answered Feb 18 '20 at 19:51

1

I have seen gcc use sse to zero out a default std::string object. Not a particularly powerful use of sse, but it exists. In most cases, though you will have to write your own.

I know this because I had allowed the stack to become unaligned and it crashed, otherwise I probably wouldn't have noticed!

answered Feb 18 '20 at 19:51

Gem Taylor

5,381
1
9
27

Yup, modern compilers targeting x86-64 freely and liberally use 16-byte loads/stores to copy around structs and to zero things. – Peter Cordes Feb 19 '20 at 02:07

score 0 · Answer 5 · answered Sep 11 '09 at 08:40

0

If you use the vector pascal compiler you will get efficient SIMD code for types for which SIMD gives an advantage. Basically this is anything of length less than 64 bits. ( for 64 bit reals it is actually slower to do SIMD). Latest versions of the compiler will also automatically parallelise accross cores

answered Sep 11 '09 at 08:40

64-bit real, aka `double`, benefits from SIMD on any CPU with SSE2, except maybe Pentium-M / Core Solo where 128b vector ops were split into two 64-bit halves, and multi-uop instructions cause decode bottlenecks. On anything after Core2 or AMD K10, SIMD is a clear win for `double` as well. – Peter Cordes Dec 15 '17 at 03:25

How do modern compilers use mmx/3dnow/sse instructions?

5 Answers5