Questions tagged [sse]

SSE (Streaming SIMD Extensions) was the first of many similarly-named vector extensions to the x86 instruction set. At this point, SSE more often a catch-all for x86 vector instructions in general, and not a reference to SSE without SSE2, SSE3, etc. (For Server-Sent Events use [server-sent-events] tag instead)

See the tag wiki for guides and other resources for programming and optimising programs using x86 vector extensions.

SIMD / SSE basics: What are the 128-bit to 512-bit registers used for? with links to many examples.


SSE/SIMD vector programming guides, focused on the SIMD aspect rather than general x86:

  • Agner Fog's Optimizing Assembly guide has a chapter on vectors, including tables of data movement instructions: broadcasts within a vector, combine data between two vectors, different kinds of shuffles, etc. It's great for finding the right instruction (on intrinsic) for the data movement you need.

  • Crunching Numbers with AVX and AVX2: An intro with examples of using C++ intrinsics

  • Slides + text: SIMD at Insomniac Games (GDC 2015): intro to SIMD, and some specific examples: checking all doors against all characters in a level. Advanced tricks: Filtering an array into a smaller array (using Left-packing based on a compare mask), with an SSSE3 pshufb solution and an SSE2 move-distance solution. Also: generating N-bit masks for variable-per-element N. Including a clever float-exponent based SSE2 version.


Instruction-set / intrinsics reference guides (see the x86 tag wiki for more links)


Miscellaneous specific things:


Streaming SIMD Extensions (SSE) basics

Together, the various SSE extensions allow working with 128b vectors of float, double, or integer (from 8b to 64b) elements. There are instructions for arithmetic, bitwise operations, shuffles, blends (conditional moves), compares, and some more-specialized operations (e.g. SAD for multimedia, carryless-multiply for crypto/finite-field math, strings (for strstr() and so on)). FP sqrt is provided, but unlike the x87 FPU, math library functions like sin must be implemented by software. SSE for scalar FP math has replaced x87 floating point, now that hardware support is near-universal.

Efficient use usually requires programs to store their data in contiguous chunks, so it can be loaded in chunks of 16B and used without too much shuffling. SSE doesn't offer loads / stores with a stride, only packed. (SoA vs. AoS: structs-of-arrays vs. arrays-of-structs). Alignment requirements on memory operands can also be a hurdle, even though modern hardware has fast unaligned loads/stores.

While there are many instructions available, the instruction set is not very orthogonal. It's not uncommon to find the operation you need, but only available for elements of a different size than you're working with. Another good example is that floating point shuffles (SHUFPS) have different semantics than 32b-integer shuffles (PSHUFD).

Details

SSE added new architectural registers (xmm0-xmm7, 128b each (xmm0-xmm15 in 64bit mode)), requiring OS support to save/restore them on context switches. The previous MMX extensions (for integer SIMD) reused the x87 FP registers.

Intel introduced MMX, original-SSE, SSE2, SSE3, SSSE3, SSE4.1, and SSE4.2. AMD's XOP (a revision of their SSE5 plans) was never picked up by Intel, and will be dropped even by future AMD designs. The instruction-set war between Intel and AMD has led to many sub-optimal results, which makes instruction decoders in CPUs require more power and transistors. (And limits opportunity for further extensions).

"SSE" commonly refers to the whole family of extensions. Writing programs that make sure to only use instructions supported by the machine they run on is necessary, and implied, and not worth cluttering our language with. (Setting function pointers is a good way to detect what's supported once at startup, avoiding a branch to select an appropriate function every time one is needed.)

Further SSE extensions are not expected: AVX introduced a new 3-operand version of all SSE instructions, as well as some new features (including dropping alignment requirements, except for explicitly-aligned moves like vmovdqa). Further vector extensions will be called AVX-something, until Intel comes up with something different enough to change the name again.

History

SSE, first introduced with the Pentium III in 1999, was Intel's reply to AMD's 3DNow extension in 1998.

The original-SSE added vector single-precision floating point math. Integer instructions to operate on xmm registers (instead of 64bit mmx regs) didn't appear until SSE2.

Original-SSE can be considered somewhat half-hearted insofar as it only covered the most basic operations and suffered from severe limitations both in functionality and performance, making it mostly useful for a few select applications, such as audio or raster image processing.

Most of SSE's limitations have been ameliorated with the SSE2 instruction set, the only notable limitation remaining to date is the lack of horizontal addition or a dot product operation in both an efficient way and widely available. While SSE3 and SSE4.1 added horizontal add and dot product instructions, they're usually slower than manual shuffle+add. Only use them at the end of a loop.

The lack of cross-manufacturer support made software development with SSE a challenge during the initial years. With AMD's adoption of SSE2 into its 64bit processors during 2003/2004, this problem gradually disappeared. As of today, there exist virtually no processors without SSE/SSE2 support. SSE2 is part of x86-64 baseline, with twice as many vector registers available in 64bit mode.

2314 questions
24
votes
2 answers

inlining failed in call to always_inline ‘_mm_mullo_epi32’: target specific option mismatch

I am trying to compile a C program using cmake which uses SIMD intrinsics. When I try to compile it, I get two errors /usr/lib/gcc/x86_64-linux-gnu/5/include/smmintrin.h:326:1: error: inlining failed in call to always_inline ‘_mm_mullo_epi32’:…
Lawan subba
  • 610
  • 3
  • 7
  • 19
24
votes
1 answer

What's the difference among cflgs sse options of -msse, -msse2, -mssse3, -msse4 rtc..? and how to determine?

For the GCC CFLAGS options: -msse, -msse2, -mssse3, -msse4, -msse4.1, -msse4.2. Are they exclusive in their use or can they be used together? My understanding is that the choosing which to set depends on whether the target arch, which the program…
yaya
  • 351
  • 1
  • 2
  • 3
23
votes
3 answers

Fastest way to do horizontal vector sum with AVX instructions

I have a packed vector of four 64-bit floating-point values. I would like to get the sum of the vector's elements. With SSE (and using 32-bit floats) I could just do the following: v_sum = _mm_hadd_ps(v_sum, v_sum); v_sum = _mm_hadd_ps(v_sum,…
Luigi Castelli
  • 676
  • 2
  • 6
  • 13
23
votes
1 answer

SSE vector wrapper type performance compared to bare __m128

I found an interesting Gamasutra article about SIMD pitfalls, which states that it is not possible to reach the performance of the "pure" __m128 type with wrapper types. Well I was skeptical, so I downloaded the project files and fabricated a…
plasmacel
  • 8,183
  • 7
  • 53
  • 101
23
votes
2 answers

AVX 256-bit code performing slightly worse than equivalent 128-bit SSSE3 code

I am trying to write very efficient Hamming-distance code. Inspired by Wojciech Muła's extremely clever SSE3 popcount implementation, I coded an AVX2 equivalent solution, this time using 256 bit registers. l was expecting at least a 30%-40%…
BlueStrat
  • 2,202
  • 17
  • 27
23
votes
5 answers

Efficient 4x4 matrix multiplication (C vs assembly)

I'm looking for a faster and trickier way to multiply two 4x4 matrices in C. My current research is focused on x86-64 assembly with SIMD extensions. So far, I've created a function witch is about 6x faster than a naive C implementation, which has…
Krzysztof Abramowicz
  • 1,556
  • 1
  • 12
  • 30
23
votes
3 answers

How to control whether C math uses SSE2?

I stepped into the assembly of the transcendental math functions of the C library with MSVC in fp:strict mode. They all seem to follow the same pattern, here's what happens for sin. First there is a dispatch routine from a file called…
Asik
  • 21,506
  • 6
  • 72
  • 131
22
votes
5 answers

Optimizing Array Compaction

Let's say I have an array k = [1 2 0 0 5 4 0] I can compute a mask as follows m = k > 0 = [1 1 0 0 1 1 0] Using only the mask m and the following operations Shift left / right And/Or Add/Subtract/Multiply I can compact k into the following [1 2 5…
jameszhao00
  • 7,213
  • 15
  • 62
  • 112
22
votes
8 answers

c++ SSE SIMD framework

Does anyone know an open-source C++ x86 SIMD intrinsics library? Intel supplies exactly what I need in their integrated performance primitives library, but I can't use that because of the copyrights all over the place. EDIT I already know the…
user283145
22
votes
1 answer

Fastest way to compute absolute value using SSE

I am aware of 3 methods, but as far as I know, only the first 2 are generally used: Mask off the sign bit using andps or andnotps. Pros: One fast instruction if the mask is already in a register, which makes it perfect for doing this many times in…
Kumputer
  • 588
  • 1
  • 6
  • 22
22
votes
4 answers

SSE integer division?

There is _mm_div_ps for floating-point values division, there is _mm_mullo_epi16 for integer multiplication. But is there something for integer division (16 bits value)? How can i conduct such division?
fogbit
  • 1,961
  • 6
  • 27
  • 41
22
votes
5 answers

How to combine two __m128 values to __m256?

I would like to combine two __m128 values to one __m256. Something like this: __m128 a = _mm_set_ps(1, 2, 3, 4); __m128 b = _mm_set_ps(5, 6, 7, 8); to something like: __m256 c = { 1, 2, 3, 4, 5, 6, 7, 8 }; are there any intrinsics that I can…
user1468756
  • 331
  • 2
  • 8
22
votes
5 answers

SIMD prefix sum on Intel cpu

I need to implement a prefix sum algorithm and would need it to be as fast as possible. Ex: [3, 1, 7, 0, 4, 1, 6, 3] should give: [3, 4, 11, 11, 15, 16, 22, 25] Is there a way to do this using SSE SIMD CPU instruction? My first idea is to…
skyde
  • 2,816
  • 4
  • 34
  • 53
21
votes
5 answers

How do modern compilers use mmx/3dnow/sse instructions?

I've been reading up on the x86 instruction set extensions, and they only seem useful in some quite specific circumstances (eg HADDPD - (Horizontal-Add-Packed-Double) in SSE3). These require a certain register layout that needs to be either…
thecoop
  • 45,220
  • 19
  • 132
  • 189
21
votes
2 answers

Choice between aligned vs. unaligned x86 SIMD instructions

There are generally two types of SIMD instructions: A. Ones that work with aligned memory addresses, that will raise general-protection (#GP) exception if the address is not aligned on the operand size boundary: movaps xmm0, xmmword ptr…
MikeF
  • 1,021
  • 9
  • 29