Questions tagged [sse]

SSE (Streaming SIMD Extensions) was the first of many similarly-named vector extensions to the x86 instruction set. At this point, SSE more often a catch-all for x86 vector instructions in general, and not a reference to SSE without SSE2, SSE3, etc. (For Server-Sent Events use [server-sent-events] tag instead)

See the tag wiki for guides and other resources for programming and optimising programs using x86 vector extensions.

SIMD / SSE basics: What are the 128-bit to 512-bit registers used for? with links to many examples.


SSE/SIMD vector programming guides, focused on the SIMD aspect rather than general x86:

  • Agner Fog's Optimizing Assembly guide has a chapter on vectors, including tables of data movement instructions: broadcasts within a vector, combine data between two vectors, different kinds of shuffles, etc. It's great for finding the right instruction (on intrinsic) for the data movement you need.

  • Crunching Numbers with AVX and AVX2: An intro with examples of using C++ intrinsics

  • Slides + text: SIMD at Insomniac Games (GDC 2015): intro to SIMD, and some specific examples: checking all doors against all characters in a level. Advanced tricks: Filtering an array into a smaller array (using Left-packing based on a compare mask), with an SSSE3 pshufb solution and an SSE2 move-distance solution. Also: generating N-bit masks for variable-per-element N. Including a clever float-exponent based SSE2 version.


Instruction-set / intrinsics reference guides (see the x86 tag wiki for more links)


Miscellaneous specific things:


Streaming SIMD Extensions (SSE) basics

Together, the various SSE extensions allow working with 128b vectors of float, double, or integer (from 8b to 64b) elements. There are instructions for arithmetic, bitwise operations, shuffles, blends (conditional moves), compares, and some more-specialized operations (e.g. SAD for multimedia, carryless-multiply for crypto/finite-field math, strings (for strstr() and so on)). FP sqrt is provided, but unlike the x87 FPU, math library functions like sin must be implemented by software. SSE for scalar FP math has replaced x87 floating point, now that hardware support is near-universal.

Efficient use usually requires programs to store their data in contiguous chunks, so it can be loaded in chunks of 16B and used without too much shuffling. SSE doesn't offer loads / stores with a stride, only packed. (SoA vs. AoS: structs-of-arrays vs. arrays-of-structs). Alignment requirements on memory operands can also be a hurdle, even though modern hardware has fast unaligned loads/stores.

While there are many instructions available, the instruction set is not very orthogonal. It's not uncommon to find the operation you need, but only available for elements of a different size than you're working with. Another good example is that floating point shuffles (SHUFPS) have different semantics than 32b-integer shuffles (PSHUFD).

Details

SSE added new architectural registers (xmm0-xmm7, 128b each (xmm0-xmm15 in 64bit mode)), requiring OS support to save/restore them on context switches. The previous MMX extensions (for integer SIMD) reused the x87 FP registers.

Intel introduced MMX, original-SSE, SSE2, SSE3, SSSE3, SSE4.1, and SSE4.2. AMD's XOP (a revision of their SSE5 plans) was never picked up by Intel, and will be dropped even by future AMD designs. The instruction-set war between Intel and AMD has led to many sub-optimal results, which makes instruction decoders in CPUs require more power and transistors. (And limits opportunity for further extensions).

"SSE" commonly refers to the whole family of extensions. Writing programs that make sure to only use instructions supported by the machine they run on is necessary, and implied, and not worth cluttering our language with. (Setting function pointers is a good way to detect what's supported once at startup, avoiding a branch to select an appropriate function every time one is needed.)

Further SSE extensions are not expected: AVX introduced a new 3-operand version of all SSE instructions, as well as some new features (including dropping alignment requirements, except for explicitly-aligned moves like vmovdqa). Further vector extensions will be called AVX-something, until Intel comes up with something different enough to change the name again.

History

SSE, first introduced with the Pentium III in 1999, was Intel's reply to AMD's 3DNow extension in 1998.

The original-SSE added vector single-precision floating point math. Integer instructions to operate on xmm registers (instead of 64bit mmx regs) didn't appear until SSE2.

Original-SSE can be considered somewhat half-hearted insofar as it only covered the most basic operations and suffered from severe limitations both in functionality and performance, making it mostly useful for a few select applications, such as audio or raster image processing.

Most of SSE's limitations have been ameliorated with the SSE2 instruction set, the only notable limitation remaining to date is the lack of horizontal addition or a dot product operation in both an efficient way and widely available. While SSE3 and SSE4.1 added horizontal add and dot product instructions, they're usually slower than manual shuffle+add. Only use them at the end of a loop.

The lack of cross-manufacturer support made software development with SSE a challenge during the initial years. With AMD's adoption of SSE2 into its 64bit processors during 2003/2004, this problem gradually disappeared. As of today, there exist virtually no processors without SSE/SSE2 support. SSE2 is part of x86-64 baseline, with twice as many vector registers available in 64bit mode.

2314 questions
52
votes
5 answers

Getting started with Intel x86 SSE SIMD instructions

I want to learn more about using the SSE. What ways are there to learn, besides the obvious reading the Intel® 64 and IA-32 Architectures Software Developer's Manuals? Mainly I'm interested to work with the GCC X86 Built-in Functions.
Liran Orevi
  • 4,755
  • 7
  • 47
  • 64
50
votes
8 answers

How to determine if memory is aligned?

I am new to optimizing code with SSE/SSE2 instructions and until now I have not gotten very far. To my knowledge a common SSE-optimized function would look like this: void sse_func(const float* const ptr, int len){ if( ptr is aligned ) { …
user229898
  • 3,057
  • 3
  • 19
  • 9
48
votes
6 answers

AVX2 what is the most efficient way to pack left based on a mask?

If you have an input array, and an output array, but you only want to write those elements which pass a certain condition, what would be the most efficient way to do this in AVX2? I've seen in SSE where it was done like…
Froglegs
  • 1,095
  • 1
  • 11
  • 21
48
votes
2 answers

How to use Fused Multiply-Add (FMA) instructions with SSE/AVX

I have learned that some Intel/AMD CPUs can do simultanous multiply and add with SSE/AVX: FLOPS per cycle for sandy-bridge and haswell SSE2/AVX/AVX2. I like to know how to do this best in code and I also want to know how it's done internally in the…
user2088790
45
votes
4 answers

Using AVX intrinsics instead of SSE does not improve speed -- why?

I've been using Intel's SSE intrinsics for quite some time with good performance gains. Hence, I expected the AVX intrinsics to further speed-up my programs. This, unfortunately, was not the case until now. Probably I am doing a stupid mistake, so I…
user1158218
  • 451
  • 1
  • 4
  • 4
45
votes
5 answers

What does ordered / unordered comparison mean?

Looking at the SSE operators CMPORDPS - ordered compare packed singles CMPUNORDPS - unordered compare packed singles What do ordered and unordered mean? I looked for equivalent instructions in the x86 instruction set, and it only seems to have…
Dan
  • 12,409
  • 3
  • 50
  • 87
43
votes
1 answer

Difference between MOVDQA and MOVAPS x86 instructions?

I'm looking Intel datasheet: Intel® 64 and IA-32 Architectures Software Developer’s Manual and I can't find the difference between MOVDQA: Move Aligned Double Quadword MOVAPS: Move Aligned Packed Single-Precision In Intel datasheet I can find…
GJ.
  • 10,810
  • 2
  • 45
  • 62
42
votes
4 answers

Intel SSE and AVX Examples and Tutorials

Is there any good C/C++ tutorials or examples for learning Intel SSE and AVX instructions? I found few on Microsoft MSDN and Intel sites, but it would be great to understand it from the basics..
veda
  • 6,416
  • 15
  • 58
  • 78
41
votes
2 answers

Are different mmx, sse and avx versions complementary or supersets of each other?

I'm thinking I should familiarize myself with x86 SIMD extensions. But before I even began I ran into trouble. I can't find a good overview on which of them are still relevant. The x86 architecture has accumulated a lot of math/multimedia extensions…
snoukkis
  • 513
  • 1
  • 4
  • 6
41
votes
8 answers

Why is strcmp not SIMD optimized?

I've tried to compile this program on an x64 computer: #include int main(int argc, char* argv[]) { return ::std::strcmp(argv[0], "really really really really really really really really really" "really really really really…
user1095108
  • 14,119
  • 9
  • 58
  • 116
36
votes
1 answer

difference between MMX and XMM register?

I'm currently learning assembly programming on Intel x86 processor. Could someone please explain to me, what is the difference between MMX and XMM register? I'm very confused in terms of what functions they serve and the difference and similarities…
Thor
  • 9,638
  • 15
  • 62
  • 137
34
votes
7 answers

SSE instructions: which CPUs can do atomic 16B memory operations?

Consider a single memory access (a single read or a single write, not read+write) SSE instruction on an x86 CPU. The instruction is accessing 16 bytes (128 bits) of memory and the accessed memory location is aligned to 16 bytes. The document "Intel®…
user811773
34
votes
0 answers

Per-element atomicity of vector load/store and gather/scatter?

Consider an array like atomic shared_array[]. What if you want to SIMD vectorize for(...) sum += shared_array[i].load(memory_order_relaxed)?. Or to search an array for the first non-zero element, or zero a range of it? It's probably…
Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
33
votes
5 answers

Can one construct a "good" hash function using CRC32C as a base?

Given that SSE 4.2 (Intel Core i7 & i5 parts) includes a CRC32 instruction, it seems reasonable to investigate whether one could build a faster general-purpose hash function. According to this only 16 bits of a CRC32 are evenly distributed. So what…
DavidD
  • 361
  • 1
  • 4
  • 5
32
votes
1 answer

What are the best instruction sequences to generate vector constants on the fly?

"Best" means fewest instructions (or fewest uops, if any instructions decode to more than one uop). Machine-code size in bytes is a tie-breaker for equal insn count. Constant-generation is by its very nature the start of a fresh dependency chain,…
Peter Cordes
  • 328,167
  • 45
  • 605
  • 847