Questions tagged [sse]

SSE (Streaming SIMD Extensions) was the first of many similarly-named vector extensions to the x86 instruction set. At this point, SSE more often a catch-all for x86 vector instructions in general, and not a reference to SSE without SSE2, SSE3, etc. (For Server-Sent Events use [server-sent-events] tag instead)

See the x86 tag wiki for guides and other resources for programming and optimising programs using x86 vector extensions.

SIMD / SSE basics: What are the 128-bit to 512-bit registers used for? with links to many examples.

SSE/SIMD vector programming guides, focused on the SIMD aspect rather than general x86:

Agner Fog's Optimizing Assembly guide has a chapter on vectors, including tables of data movement instructions: broadcasts within a vector, combine data between two vectors, different kinds of shuffles, etc. It's great for finding the right instruction (on intrinsic) for the data movement you need.
Crunching Numbers with AVX and AVX2: An intro with examples of using C++ intrinsics
Slides + text: SIMD at Insomniac Games (GDC 2015): intro to SIMD, and some specific examples: checking all doors against all characters in a level. Advanced tricks: Filtering an array into a smaller array (using Left-packing based on a compare mask), with an SSSE3 pshufb solution and an SSE2 move-distance solution. Also: generating N-bit masks for variable-per-element N. Including a clever float-exponent based SSE2 version.

Instruction-set / intrinsics reference guides (see the x86 tag wiki for more links)

Intel's vector intrinsics finder/search (very good): search by asm mnemonic or C intrinsic name. Filter by type and/or by instruction-set extension family (e.g. exclude AVX512 and later). Occasionally buggy, esp. the performance info. (Look at Agner Fog's tables for performance info, although it has occasional errors or typos, too).
Intel's manuals, including instruction set reference manual. Very detailed description of what every instruction does, using pseudo-code. These manuals are accurate much more often than the intrinsics guide.
x86/x64 SIMD Instruction List (SSE to AVX512) Beta: A nice compact table listing instruction mnemonics and their intrinsics, broken down by type and element-size. Detailed pages with graphical data-movement diagrams for each instruction.

Miscellaneous specific things:

Shuffling by mask with Intel AVX explains how shuffle-control vectors and _MM_SHUFFLE work, including in-lane vs. lane-crossing for AVX.
SSE interleave/merge/combine 2 vectors using a mask, per-element conditional move? Blends, especially variable blends (blendvps)
What are the best instruction sequences to generate vector constants on the fly?. In C/C++, almost always prefer _mm_set or _mm_set1 to initialize local variables (not globals), rather than defining arrays and loading from them.
print a __m128i variable: How to safely and portably access the elements of a vector, and how to debug-print them.

Streaming SIMD Extensions (SSE) basics

Together, the various SSE extensions allow working with 128b vectors of float, double, or integer (from 8b to 64b) elements. There are instructions for arithmetic, bitwise operations, shuffles, blends (conditional moves), compares, and some more-specialized operations (e.g. SAD for multimedia, carryless-multiply for crypto/finite-field math, strings (for strstr() and so on)). FP sqrt is provided, but unlike the x87 FPU, math library functions like sin must be implemented by software. SSE for scalar FP math has replaced x87 floating point, now that hardware support is near-universal.

Efficient use usually requires programs to store their data in contiguous chunks, so it can be loaded in chunks of 16B and used without too much shuffling. SSE doesn't offer loads / stores with a stride, only packed. (SoA vs. AoS: structs-of-arrays vs. arrays-of-structs). Alignment requirements on memory operands can also be a hurdle, even though modern hardware has fast unaligned loads/stores.

While there are many instructions available, the instruction set is not very orthogonal. It's not uncommon to find the operation you need, but only available for elements of a different size than you're working with. Another good example is that floating point shuffles (SHUFPS) have different semantics than 32b-integer shuffles (PSHUFD).

Details

SSE added new architectural registers (xmm0-xmm7, 128b each (xmm0-xmm15 in 64bit mode)), requiring OS support to save/restore them on context switches. The previous MMX extensions (for integer SIMD) reused the x87 FP registers.

Intel introduced MMX, original-SSE, SSE2, SSE3, SSSE3, SSE4.1, and SSE4.2. AMD's XOP (a revision of their SSE5 plans) was never picked up by Intel, and will be dropped even by future AMD designs. The instruction-set war between Intel and AMD has led to many sub-optimal results, which makes instruction decoders in CPUs require more power and transistors. (And limits opportunity for further extensions).

"SSE" commonly refers to the whole family of extensions. Writing programs that make sure to only use instructions supported by the machine they run on is necessary, and implied, and not worth cluttering our language with. (Setting function pointers is a good way to detect what's supported once at startup, avoiding a branch to select an appropriate function every time one is needed.)

Further SSE extensions are not expected: AVX introduced a new 3-operand version of all SSE instructions, as well as some new features (including dropping alignment requirements, except for explicitly-aligned moves like vmovdqa). Further vector extensions will be called AVX-something, until Intel comes up with something different enough to change the name again.

History

SSE, first introduced with the Pentium III in 1999, was Intel's reply to AMD's 3DNow extension in 1998.

The original-SSE added vector single-precision floating point math. Integer instructions to operate on xmm registers (instead of 64bit mmx regs) didn't appear until SSE2.

Original-SSE can be considered somewhat half-hearted insofar as it only covered the most basic operations and suffered from severe limitations both in functionality and performance, making it mostly useful for a few select applications, such as audio or raster image processing.

Most of SSE's limitations have been ameliorated with the SSE2 instruction set, the only notable limitation remaining to date is the lack of horizontal addition or a dot product operation in both an efficient way and widely available. While SSE3 and SSE4.1 added horizontal add and dot product instructions, they're usually slower than manual shuffle+add. Only use them at the end of a loop.

The lack of cross-manufacturer support made software development with SSE a challenge during the initial years. With AMD's adoption of SSE2 into its 64bit processors during 2003/2004, this problem gradually disappeared. As of today, there exist virtually no processors without SSE/SSE2 support. SSE2 is part of x86-64 baseline, with twice as many vector registers available in 64bit mode.

2314 questions

vote

1 answer

How to load 16 bytes of memory into a Rust __m128i?

I am trying to load 16 bytes of memory into an __m128i type from the std::arch module: #[cfg(all(target_arch = "x86_64", target_feature = "sse2"))] use std::arch::x86_64::__m128i; fn foo() { #[cfg(all(target_arch = "x86_64", target_feature =…

asked Jun 08 '21 at 19:34

Josh Weinstein

2,788
2
21
38

vote

0 answers

Get uint64_t out of __m128 intrinsic?

I can use _mm_set_epi64 to store two uint64_ts into a __m128 intrinsic. But hunting around, I see various ways to get the values back out: There's reinterpret_cast (and it's evil twin C-style casts), it's sibling union { __m128; uint64[2]; }; and…

sse simd intrinsics

asked May 24 '21 at 12:27

Ben

9,184
1
43
56

vote

1 answer

"Instruction operands must be the same size" for MOVDQU from .data array

I have an .asm file with 2 arrays: .DATA compara byte 16 dup (?) subtrai byte 16 dup (128) Then I tried to use movdqu on the arrays (to xmm1 and xmm2), but I'm having a problem. Even though they are the same size, each array stores 16 bytes of…

assembly x86 masm sse

asked May 06 '21 at 17:58

user11420703

vote

0 answers

How would you port this "unsigned int" scalar code to "signed int" vector?

I need to port a Xorshift algorithm from scalar to vector code (SSE/SIMD version built with -march=nocona). I'm using the uint32_t version of the algorithm (taken directly from wiki): #include struct xorshift32_state { uint32_t…

c++ undefined-behavior sse intrinsics unsigned-integer

asked Apr 23 '21 at 07:02

markzzz

47,390
120
299
507

vote

3 answers

Convert 16 bits mask to 16 bytes mask

Is there any way to convert the following code: int mask16 = 0b1010101010101010; // int or short, signed or unsigned, it does not matter to __uint128_t mask128 = ((__uint128_t)0x0100010001000100 << 64) | 0x0100010001000100; So to be extra clear…

c++ c bit-manipulation sse intrinsics

asked Apr 21 '21 at 18:19

Antonin GAVREL

9,682
8
54
81

vote

2 answers

how clang decide alignment and use aligned load/store instruction

In my recent C++ code; I found Clang generated asm code use instruction movaps to memset the object to 0. because of this movaps instruction need memory alignment of 16; and when i use a self allocated buffer to initialize this object, the program…

clang compiler-optimization sse memory-alignment

asked Apr 19 '21 at 14:32

Chinaxing

8,054
4
28
36

vote

1 answer

Load or shuffle a pair of floats with SIMD intrinsics for doubles?

I write some optimizations for processing single precision floating-point calculation SIMD intrinsics. Sometimes a pd double-precision instruction does what I want more easily than any ps single precision one. Example 1: I have pointer float prt*…

c sse simd intrinsics avx

asked Apr 16 '21 at 08:10

Yuriy

vote

1 answer

How to read optimally from an array (in memory) having array position from a vector?

I've such a code: const rack::simd::float_4 pos = phase * waveTable.mLength; const rack::simd::int32_4 pos0 = pos; const rack::simd::float_4 frac = pos - (rack::simd::float_4)pos0; rack::simd::float_4 v0; rack::simd::float_4 v1; for (int v = 0; v <…

c++ arrays performance sse simd

asked Apr 13 '21 at 14:33

markzzz

47,390
120
299
507

vote

1 answer

Mixing OpenMP and xmmintrin SSE Intrinsics - not getting speedup over the non-parallel version

I've implemented a version of the Travelling Salesman with xmmintrin.h SSE instructions, received a decent speedup. But now I'm also trying to implement OpenMP threading on top of it, and I'm seeing a pretty drastic slow down. I'm getting the…

c multithreading parallel-processing openmp sse

asked Apr 06 '21 at 01:07

NukPan

vote

1 answer

what is difference between (m128)(&A) and (m128)A

what is difference between *(B*)(&A) and (B)A I'm using simd codes. but I confront problem. I couldn't cast my own vector4 type to __m128 So I did like this this works well #define XMM128Float(VECTOR4FLOAT) *(__m128*)(&VECTOR4FLOAT) Vector4…

c++ sse simd

asked Mar 22 '21 at 09:46

SungJinKang

vote

2 answers

Bit manipulations with SSE on subbytes?

Is it possible to use SSE for bit manipulations on data that is not byte-aligned? For example, I would like to do implement this using SSE: const char buf[8]; assert(n <= 8); long rv = 0; for (int i = 0; i < n; i++) rv = (rv << 6) | (buf[i] &…

assembly bit-manipulation sse

asked Jul 12 '11 at 20:04

hrr

1,807
2
21
35

vote

0 answers

Interleave two vectors

I'm trying my first steps with SIMD and I was wondering what the right approach is to the following problem. Consider two vectors: +---+---+---+---+ +---+---+---+---+ | 0 | 1 | 2 | 3 | | 4 | 5 | 6 | 7 | +---+---+---+---+ …

vectorization sse simd intrinsics avx

asked Mar 16 '21 at 22:31

Ecir Hana

10,864
13
67
117

vote

1 answer

Gcc misoptimises sse function

I'm converting a project to compile with gcc from clang and I've ran into a issue with a function that uses sse functions: void dodgy_function( const short* lows, const short* highs, short* mins, short* maxs, int its ) { …

c++ gcc sse intrinsics strict-aliasing

asked Mar 12 '21 at 19:39

Biggy Smith

vote

1 answer

Optimizing find_first_not_of with SSE4.2 or earlier

I am writing a textual packet analyzer for a protocol and in optimizing it I found that a great bottleneck is the find_first_not_of call. In essence, I need to find if a packet is valid if it contains only valid characters, faster than the default…

string optimization sse intrinsics sse4

asked Mar 08 '21 at 16:34

senseiwa

2,369
3
24
47

vote

1 answer

Shift values in AVX2 register, grabbing last one from another register

I have two AVX2 registers, for instance with following values: m0 = {0,1,2,3,4,5,6,7} m1 = {8,9,a,b,c,d,e,f} I need to shift m0 grabbing last value from m1: m0 = {1,2,3,4,5,6,7,8} Then perform some arithmetic with m0, and shift again: m0 =…

x86 sse simd avx2

asked Feb 11 '21 at 14:37

user2052436

4,321
1
25
46

Prev 1 2 3

…

100