Questions tagged [sse]

SSE (Streaming SIMD Extensions) was the first of many similarly-named vector extensions to the x86 instruction set. At this point, SSE more often a catch-all for x86 vector instructions in general, and not a reference to SSE without SSE2, SSE3, etc. (For Server-Sent Events use [server-sent-events] tag instead)

See the x86 tag wiki for guides and other resources for programming and optimising programs using x86 vector extensions.

SIMD / SSE basics: What are the 128-bit to 512-bit registers used for? with links to many examples.

SSE/SIMD vector programming guides, focused on the SIMD aspect rather than general x86:

Agner Fog's Optimizing Assembly guide has a chapter on vectors, including tables of data movement instructions: broadcasts within a vector, combine data between two vectors, different kinds of shuffles, etc. It's great for finding the right instruction (on intrinsic) for the data movement you need.
Crunching Numbers with AVX and AVX2: An intro with examples of using C++ intrinsics
Slides + text: SIMD at Insomniac Games (GDC 2015): intro to SIMD, and some specific examples: checking all doors against all characters in a level. Advanced tricks: Filtering an array into a smaller array (using Left-packing based on a compare mask), with an SSSE3 pshufb solution and an SSE2 move-distance solution. Also: generating N-bit masks for variable-per-element N. Including a clever float-exponent based SSE2 version.

Instruction-set / intrinsics reference guides (see the x86 tag wiki for more links)

Intel's vector intrinsics finder/search (very good): search by asm mnemonic or C intrinsic name. Filter by type and/or by instruction-set extension family (e.g. exclude AVX512 and later). Occasionally buggy, esp. the performance info. (Look at Agner Fog's tables for performance info, although it has occasional errors or typos, too).
Intel's manuals, including instruction set reference manual. Very detailed description of what every instruction does, using pseudo-code. These manuals are accurate much more often than the intrinsics guide.
x86/x64 SIMD Instruction List (SSE to AVX512) Beta: A nice compact table listing instruction mnemonics and their intrinsics, broken down by type and element-size. Detailed pages with graphical data-movement diagrams for each instruction.

Miscellaneous specific things:

Shuffling by mask with Intel AVX explains how shuffle-control vectors and _MM_SHUFFLE work, including in-lane vs. lane-crossing for AVX.
SSE interleave/merge/combine 2 vectors using a mask, per-element conditional move? Blends, especially variable blends (blendvps)
What are the best instruction sequences to generate vector constants on the fly?. In C/C++, almost always prefer _mm_set or _mm_set1 to initialize local variables (not globals), rather than defining arrays and loading from them.
print a __m128i variable: How to safely and portably access the elements of a vector, and how to debug-print them.

Streaming SIMD Extensions (SSE) basics

Together, the various SSE extensions allow working with 128b vectors of float, double, or integer (from 8b to 64b) elements. There are instructions for arithmetic, bitwise operations, shuffles, blends (conditional moves), compares, and some more-specialized operations (e.g. SAD for multimedia, carryless-multiply for crypto/finite-field math, strings (for strstr() and so on)). FP sqrt is provided, but unlike the x87 FPU, math library functions like sin must be implemented by software. SSE for scalar FP math has replaced x87 floating point, now that hardware support is near-universal.

Efficient use usually requires programs to store their data in contiguous chunks, so it can be loaded in chunks of 16B and used without too much shuffling. SSE doesn't offer loads / stores with a stride, only packed. (SoA vs. AoS: structs-of-arrays vs. arrays-of-structs). Alignment requirements on memory operands can also be a hurdle, even though modern hardware has fast unaligned loads/stores.

While there are many instructions available, the instruction set is not very orthogonal. It's not uncommon to find the operation you need, but only available for elements of a different size than you're working with. Another good example is that floating point shuffles (SHUFPS) have different semantics than 32b-integer shuffles (PSHUFD).

Details

SSE added new architectural registers (xmm0-xmm7, 128b each (xmm0-xmm15 in 64bit mode)), requiring OS support to save/restore them on context switches. The previous MMX extensions (for integer SIMD) reused the x87 FP registers.

Intel introduced MMX, original-SSE, SSE2, SSE3, SSSE3, SSE4.1, and SSE4.2. AMD's XOP (a revision of their SSE5 plans) was never picked up by Intel, and will be dropped even by future AMD designs. The instruction-set war between Intel and AMD has led to many sub-optimal results, which makes instruction decoders in CPUs require more power and transistors. (And limits opportunity for further extensions).

"SSE" commonly refers to the whole family of extensions. Writing programs that make sure to only use instructions supported by the machine they run on is necessary, and implied, and not worth cluttering our language with. (Setting function pointers is a good way to detect what's supported once at startup, avoiding a branch to select an appropriate function every time one is needed.)

Further SSE extensions are not expected: AVX introduced a new 3-operand version of all SSE instructions, as well as some new features (including dropping alignment requirements, except for explicitly-aligned moves like vmovdqa). Further vector extensions will be called AVX-something, until Intel comes up with something different enough to change the name again.

History

SSE, first introduced with the Pentium III in 1999, was Intel's reply to AMD's 3DNow extension in 1998.

The original-SSE added vector single-precision floating point math. Integer instructions to operate on xmm registers (instead of 64bit mmx regs) didn't appear until SSE2.

Original-SSE can be considered somewhat half-hearted insofar as it only covered the most basic operations and suffered from severe limitations both in functionality and performance, making it mostly useful for a few select applications, such as audio or raster image processing.

Most of SSE's limitations have been ameliorated with the SSE2 instruction set, the only notable limitation remaining to date is the lack of horizontal addition or a dot product operation in both an efficient way and widely available. While SSE3 and SSE4.1 added horizontal add and dot product instructions, they're usually slower than manual shuffle+add. Only use them at the end of a loop.

The lack of cross-manufacturer support made software development with SSE a challenge during the initial years. With AMD's adoption of SSE2 into its 64bit processors during 2003/2004, this problem gradually disappeared. As of today, there exist virtually no processors without SSE/SSE2 support. SSE2 is part of x86-64 baseline, with twice as many vector registers available in 64bit mode.

2314 questions

vote

1 answer

Can someone walk me through the x64 assembly code for this GCC auto-vectorized C loop that sums an array

I compiled the following C code into assembly with -03 and I am confused why we shift right to %xmm1 and add it back to %xmm0. Can someone walk me through what the assembly code does and why it makes everything a factor of 16 than 4? The code in C…

assembly x86-64 simd sse auto-vectorization

asked Feb 22 '22 at 03:35

confused student

vote

2 answers

How can I search for intel intrinsic functions in timing tables?

I've looked through the sse wiki and x86 wiki, and there appear to be several great references for looking up either specific intel intrinsic functions or the latencies of assembly instructions on various processor architectures. Intel's intrinsics…

simd sse intrinsics

asked Feb 16 '22 at 16:47

drakon101

vote

1 answer

How does strncmp using SSE 4.2 avoid reading beyond the page boundaries when loading 16 bytes?

glibc now uses SSE 4.2 to optimize strncmp: https://github.com/lattera/glibc/blob/master/sysdeps/x86_64/multiarch/strcmp-sse42.S https://www.strchr.com/strcmp_and_strlen_using_sse_4.2 This can be seen in a debugger: 0xf7f20218…

memory x86 valgrind sse glibc

asked Feb 09 '22 at 05:24

SRobertJames

8,210
14
60
107

vote

1 answer

Why doesn't gcc zero the upper values of an XMM register when only using the lower value with SS/SD instructions?

For example with such function, int fb(char a, char b, char c, char d) { return (a + b) - (c + d); } gcc's assembly output is, fb: movsx esi, sil movsx edi, dil movsx ecx, cl movsx edx, dl add …

c assembly x86 sse calling-convention

asked Jan 18 '22 at 17:42

xiver77

2,162
1
2
12

vote

2 answers

Multiplying and adding float numbers

I have a task to convert some c++ code to asm and I wonder if what I am thinking makes any sense. First I would convert integers to floats. I would like to get array data to sse register, but here is problem, because I want only 3 not 4 integers, is…

c++ assembly x86-64 masm sse

asked Jan 08 '22 at 22:31

thomas113412

vote

0 answers

Why does rounding-to-nearest-with-ties-away-from-zero require more instructions and what is their purpose?

Consider this example, in which various rounding operations (round-up, round-down, round-toward-zero and round-to-nearest-with-ties-to-even) can all be expressed with a single roundsd instruction: use_floor(double): roundsd xmm0, xmm0, 9 …

assembly x86 rounding sse

asked Jan 06 '22 at 13:01

soc

27,983
20
111
215

vote

1 answer

How to horizontally sum signed bytes in XMM

I am writing some code in x64 assembly and using SIMD. I have 9 bytes packed in the xmm15 register. For simplicity, let's look at the following code: .data Masks BYTE 0, -1, 0, -1, 5, -1, 0, -1, 0 .code GetSumOfMasks proc movdqu xmm15, xmmword ptr…

assembly x86-64 masm sse masm64

asked Dec 15 '21 at 21:20

nooblet2

vote

1 answer

Compact storage of shuffle vectors: unpacking 4 bytes to shuffle uint32_t elements with a byte-shuffle

I have a cross architecture code that looks up a shuffle by index, for moving uint32_t elements within a vector. A whole vector constant is needed for each shuffle, but there are only 4 bytes of non-redundant information. (Or really 4x 2 bits of…

c sse intrinsics neon

asked Sep 28 '21 at 20:53

Denis Yaroshevskiy

1,218
11
24

vote

0 answers

Why do bitwise operation (and, or, xor) on floating-point data types exist in SSE/AVX

SSE has _mm_xor_ps, _mm_xor_pd, _mm_and_ps, _mm_and_pd, _mm_or_ps, _mm_or_pd. As floating-point type consist of mantissa, exponent, and sign, the result of treating them as sequence of bits does not look meaningful (except xoring with self to have…

sse

asked Sep 10 '21 at 12:49

Alex Guteniev

12,039
2
34
79

vote

3 answers

fC - How can I define SIMD variable(s) outside of a function?

const __m128i ___n = _mm_set_epi32( 0x80808080, 0x80808080, 0x80808080, 0x80808080 ); const __m128i w___ = _mm_set_epi32( 0x80808080, 0x80808080, 0x80808080, 0x0f0e0d0c ); const __m128i z___ = _mm_set_epi32( 0x80808080, 0x80808080,…

c simd sse intrinsics

asked Aug 29 '21 at 13:59

Timothy s

vote

1 answer

can I assign the result of intrinsic that returns m128i to variable of the typem128i_u?

as in the title - I want to do as below: __m128i_u* avxVar = (__m128i_u*)Var; // Var allocated with alloc *avxVar = _mm_set_epi64(...); // is that ok to assign __m128i to __m128i_u ?

simd sse intrinsics sse2

asked Aug 26 '21 at 13:35

vela18

vote

1 answer

Can't convert value to a vector with Intel Intrinsics

I am using Intel Intrinsics and getting this odd error. src/header/header.c:18:3: error: can’t convert value to a vector 18 | int has_value = (int)_mm_cmpestrc(buffer, 4, u_str.vec, 4, | ^~~ I have tried the below without the…

c gcc x86 sse intrinsics

asked Aug 22 '21 at 19:44

Christopher Clark

vote

1 answer

how to debug a _mm_mul_ps function?

I've this code: inline __m128 process(const __m128 *buffer) { __m128 crashTest; for (int i = 0; i < mFactor; i++) { crashTest = _mm_mul_ps(buffer[i], _mm_set1_ps((float)(((int32_t)1) << 16))); } return crashTest; } when I…

c++ segmentation-fault sse simd intrinsics

asked Aug 03 '21 at 17:39

markzzz

47,390
120
299
507

vote

1 answer

Mult plus shift left ops using MMX assembler instructions

I am looking for doing shl(mult(var1,var2),1) operation, where mult multiplies var1 and var2 (both are 16-bit signed integers) and shl shifts left arithmetically the multiplication result. Result must be saturated, i.e., int32 max or int32 min if…

assembly x86 sse mmx saturation-arithmetic

asked Jul 27 '11 at 19:15

LooPer

1,459
2
15
24

vote

0 answers

A simple WebAssembly and Javascript Benchmark Scenario

I built a simple javascript vs. WebAssembly/SIMD benchmark as follows: var sum = 0; for (var c=0; c

benchmarking sse simd webassembly

asked Jul 18 '21 at 21:01

user2566142

Prev 1 2 3

…

100 Next