Questions tagged [sse]

SSE (Streaming SIMD Extensions) was the first of many similarly-named vector extensions to the x86 instruction set. At this point, SSE more often a catch-all for x86 vector instructions in general, and not a reference to SSE without SSE2, SSE3, etc. (For Server-Sent Events use [server-sent-events] tag instead)

See the tag wiki for guides and other resources for programming and optimising programs using x86 vector extensions.

SIMD / SSE basics: What are the 128-bit to 512-bit registers used for? with links to many examples.


SSE/SIMD vector programming guides, focused on the SIMD aspect rather than general x86:

  • Agner Fog's Optimizing Assembly guide has a chapter on vectors, including tables of data movement instructions: broadcasts within a vector, combine data between two vectors, different kinds of shuffles, etc. It's great for finding the right instruction (on intrinsic) for the data movement you need.

  • Crunching Numbers with AVX and AVX2: An intro with examples of using C++ intrinsics

  • Slides + text: SIMD at Insomniac Games (GDC 2015): intro to SIMD, and some specific examples: checking all doors against all characters in a level. Advanced tricks: Filtering an array into a smaller array (using Left-packing based on a compare mask), with an SSSE3 pshufb solution and an SSE2 move-distance solution. Also: generating N-bit masks for variable-per-element N. Including a clever float-exponent based SSE2 version.


Instruction-set / intrinsics reference guides (see the x86 tag wiki for more links)


Miscellaneous specific things:


Streaming SIMD Extensions (SSE) basics

Together, the various SSE extensions allow working with 128b vectors of float, double, or integer (from 8b to 64b) elements. There are instructions for arithmetic, bitwise operations, shuffles, blends (conditional moves), compares, and some more-specialized operations (e.g. SAD for multimedia, carryless-multiply for crypto/finite-field math, strings (for strstr() and so on)). FP sqrt is provided, but unlike the x87 FPU, math library functions like sin must be implemented by software. SSE for scalar FP math has replaced x87 floating point, now that hardware support is near-universal.

Efficient use usually requires programs to store their data in contiguous chunks, so it can be loaded in chunks of 16B and used without too much shuffling. SSE doesn't offer loads / stores with a stride, only packed. (SoA vs. AoS: structs-of-arrays vs. arrays-of-structs). Alignment requirements on memory operands can also be a hurdle, even though modern hardware has fast unaligned loads/stores.

While there are many instructions available, the instruction set is not very orthogonal. It's not uncommon to find the operation you need, but only available for elements of a different size than you're working with. Another good example is that floating point shuffles (SHUFPS) have different semantics than 32b-integer shuffles (PSHUFD).

Details

SSE added new architectural registers (xmm0-xmm7, 128b each (xmm0-xmm15 in 64bit mode)), requiring OS support to save/restore them on context switches. The previous MMX extensions (for integer SIMD) reused the x87 FP registers.

Intel introduced MMX, original-SSE, SSE2, SSE3, SSSE3, SSE4.1, and SSE4.2. AMD's XOP (a revision of their SSE5 plans) was never picked up by Intel, and will be dropped even by future AMD designs. The instruction-set war between Intel and AMD has led to many sub-optimal results, which makes instruction decoders in CPUs require more power and transistors. (And limits opportunity for further extensions).

"SSE" commonly refers to the whole family of extensions. Writing programs that make sure to only use instructions supported by the machine they run on is necessary, and implied, and not worth cluttering our language with. (Setting function pointers is a good way to detect what's supported once at startup, avoiding a branch to select an appropriate function every time one is needed.)

Further SSE extensions are not expected: AVX introduced a new 3-operand version of all SSE instructions, as well as some new features (including dropping alignment requirements, except for explicitly-aligned moves like vmovdqa). Further vector extensions will be called AVX-something, until Intel comes up with something different enough to change the name again.

History

SSE, first introduced with the Pentium III in 1999, was Intel's reply to AMD's 3DNow extension in 1998.

The original-SSE added vector single-precision floating point math. Integer instructions to operate on xmm registers (instead of 64bit mmx regs) didn't appear until SSE2.

Original-SSE can be considered somewhat half-hearted insofar as it only covered the most basic operations and suffered from severe limitations both in functionality and performance, making it mostly useful for a few select applications, such as audio or raster image processing.

Most of SSE's limitations have been ameliorated with the SSE2 instruction set, the only notable limitation remaining to date is the lack of horizontal addition or a dot product operation in both an efficient way and widely available. While SSE3 and SSE4.1 added horizontal add and dot product instructions, they're usually slower than manual shuffle+add. Only use them at the end of a loop.

The lack of cross-manufacturer support made software development with SSE a challenge during the initial years. With AMD's adoption of SSE2 into its 64bit processors during 2003/2004, this problem gradually disappeared. As of today, there exist virtually no processors without SSE/SSE2 support. SSE2 is part of x86-64 baseline, with twice as many vector registers available in 64bit mode.

2314 questions
21
votes
3 answers

How to solve the 32-byte-alignment issue for AVX load/store operations?

I am having alignment issue while using ymm registers, with some snippets of code that seems fine to me. Here is a minimal working example: #include #include inline void ones(float *a) { __m256 out_aligned =…
romeric
  • 2,325
  • 3
  • 19
  • 35
21
votes
4 answers

SSE2 integer overflow checking

When using SSE2 instructions such as PADDD (i.e., the _mm_add_epi32 intrinsic), is there a way to check whether any of the operations overflowed? I thought that maybe a flag on the MXCSR control register may get set after an overflow, but I don't…
Igor ostrovsky
  • 7,282
  • 2
  • 29
  • 28
20
votes
3 answers

What's the difference between logical SSE intrinsics?

Is there any difference between logical SSE intrinsics for different types? For example if we take OR operation, there are three intrinsics: _mm_or_ps, _mm_or_pd and _mm_or_si128 all of which do the same thing: compute bitwise OR of their operands.…
user283145
20
votes
3 answers

Auto-vectorizing: Convincing the compiler that alias check is not necessary

I am doing some image processing, for which I benefit from vectorization. I have a function that vectorizes ok, but for which I am not able to convince the compiler that the input and output buffer have no overlap, and so no alias checking is…
Antonio
  • 19,451
  • 13
  • 99
  • 197
20
votes
4 answers

Is the SSE unaligned load intrinsic any slower than the aligned load intrinsic on x64_64 Intel CPUs?

I'm considering changing some code high performance code that currently requires 16 byte aligned arrays and uses _mm_load_ps to relax the alignment constraint and use _mm_loadu_ps. There are a lot of myths about the performance implications of…
Robert T. McGibbon
  • 5,075
  • 3
  • 37
  • 45
20
votes
5 answers

SSE-copy, AVX-copy and std::copy performance

I'm tried to improve performance of copy operation via SSE and AVX: #include const int sz = 1024; float *mas = (float *)_mm_malloc(sz*sizeof(float), 16); float *tar = (float *)_mm_malloc(sz*sizeof(float), 16); …
gorill
  • 1,623
  • 3
  • 20
  • 29
20
votes
2 answers

How to sum __m256 horizontally?

I would like to horizontally sum the components of a __m256 vector using AVX instructions. In SSE I could use _mm_hadd_ps(xmm,xmm); _mm_hadd_ps(xmm,xmm); to get the result at the first component of the vector, but this does not scale with the 256…
Yoav
  • 5,962
  • 5
  • 39
  • 61
19
votes
1 answer

SSE 4 instructions generated by Visual Studio 2013 Update 2 and Update 3

If I compile this code in VS 2013 Update 2 or Update 3: (below comes from Update 3) #include "stdafx.h" #include #include struct Buffer { long* data; int count; }; #ifndef max #define max(a,b) (((a) > (b)) ?…
Yakk - Adam Nevraumont
  • 262,606
  • 27
  • 330
  • 524
19
votes
1 answer

How to implement "_mm_storeu_epi64" without aliasing problems?

(Note: Although this question is about "store", the "load" case has the same issues and is perfectly symmetric.) The SSE intrinsics provide an _mm_storeu_pd function with the following signature: void _mm_storeu_pd (double *p, __m128d a); So if I…
Nemo
  • 70,042
  • 10
  • 116
  • 153
19
votes
8 answers

Fast dot product of a bit vector and a floating point vector

I'm trying to compute the dot product between a floating and a bit vector in the most efficient manner on a i7. In reality, I'm doing this operation on either 128 or 256-dimensional vectors, but for illustration, let me write the code for…
John Smith
  • 980
  • 1
  • 8
  • 15
19
votes
1 answer

loop tiling/blocking for large dense matrix multiplication

I was wondering if someone could show me how to use loop tiling/loop blocking for large dense matrix multiplication effectively. I am doing C = AB with 1000x1000 matrices. I have followed the example on Wikipedia for loop tiling but I get worse…
user2088790
19
votes
2 answers

How to rotate an SSE/AVX vector

I need to perform a rotate operation with as little clock cycles as possible. In the first case let's assume __m128i as source and dest type: source: || A0 || A1 || A2 || A3 || dest: || A1 || A2 || A3 || A0 || dest =…
user1584773
  • 699
  • 7
  • 19
19
votes
3 answers

SSE: Difference between _mm_load/store vs. using direct pointer access

Suppose I want to add two buffers and store the result. Both buffers are already allocated 16byte aligned. I found two examples how to do that. The first one is using _mm_load to read the data from the buffer into an SSE register, does the add…
Peter
  • 785
  • 2
  • 7
  • 18
18
votes
2 answers

Reference manual/tutorial for x86 SIMD intrinsics?

I'm looking into using these to improve the performance of some code but good documentation seems hard to find for the functions defined in the *mmintrin.h headers, can anybody provide me with pointers to good info on these? EDIT: particularly…
BD at Rivenhill
  • 12,395
  • 10
  • 46
  • 49
18
votes
3 answers

Get sum of values stored in __m256d with SSE/AVX

Is there a way to get sum of values stored in __m256d variable? I have this code. acc = _mm256_add_pd(acc, _mm256_mul_pd(row, vec)); //acc in this point contains {2.0, 8.0, 18.0, 32.0} acc = _mm256_hadd_pd(acc, acc); result[i] = ((double*)&acc)[0] +…
Peter
  • 435
  • 1
  • 3
  • 11