Questions tagged [sse]

SSE (Streaming SIMD Extensions) was the first of many similarly-named vector extensions to the x86 instruction set. At this point, SSE more often a catch-all for x86 vector instructions in general, and not a reference to SSE without SSE2, SSE3, etc. (For Server-Sent Events use [server-sent-events] tag instead)

See the tag wiki for guides and other resources for programming and optimising programs using x86 vector extensions.

SIMD / SSE basics: What are the 128-bit to 512-bit registers used for? with links to many examples.


SSE/SIMD vector programming guides, focused on the SIMD aspect rather than general x86:

  • Agner Fog's Optimizing Assembly guide has a chapter on vectors, including tables of data movement instructions: broadcasts within a vector, combine data between two vectors, different kinds of shuffles, etc. It's great for finding the right instruction (on intrinsic) for the data movement you need.

  • Crunching Numbers with AVX and AVX2: An intro with examples of using C++ intrinsics

  • Slides + text: SIMD at Insomniac Games (GDC 2015): intro to SIMD, and some specific examples: checking all doors against all characters in a level. Advanced tricks: Filtering an array into a smaller array (using Left-packing based on a compare mask), with an SSSE3 pshufb solution and an SSE2 move-distance solution. Also: generating N-bit masks for variable-per-element N. Including a clever float-exponent based SSE2 version.


Instruction-set / intrinsics reference guides (see the x86 tag wiki for more links)


Miscellaneous specific things:


Streaming SIMD Extensions (SSE) basics

Together, the various SSE extensions allow working with 128b vectors of float, double, or integer (from 8b to 64b) elements. There are instructions for arithmetic, bitwise operations, shuffles, blends (conditional moves), compares, and some more-specialized operations (e.g. SAD for multimedia, carryless-multiply for crypto/finite-field math, strings (for strstr() and so on)). FP sqrt is provided, but unlike the x87 FPU, math library functions like sin must be implemented by software. SSE for scalar FP math has replaced x87 floating point, now that hardware support is near-universal.

Efficient use usually requires programs to store their data in contiguous chunks, so it can be loaded in chunks of 16B and used without too much shuffling. SSE doesn't offer loads / stores with a stride, only packed. (SoA vs. AoS: structs-of-arrays vs. arrays-of-structs). Alignment requirements on memory operands can also be a hurdle, even though modern hardware has fast unaligned loads/stores.

While there are many instructions available, the instruction set is not very orthogonal. It's not uncommon to find the operation you need, but only available for elements of a different size than you're working with. Another good example is that floating point shuffles (SHUFPS) have different semantics than 32b-integer shuffles (PSHUFD).

Details

SSE added new architectural registers (xmm0-xmm7, 128b each (xmm0-xmm15 in 64bit mode)), requiring OS support to save/restore them on context switches. The previous MMX extensions (for integer SIMD) reused the x87 FP registers.

Intel introduced MMX, original-SSE, SSE2, SSE3, SSSE3, SSE4.1, and SSE4.2. AMD's XOP (a revision of their SSE5 plans) was never picked up by Intel, and will be dropped even by future AMD designs. The instruction-set war between Intel and AMD has led to many sub-optimal results, which makes instruction decoders in CPUs require more power and transistors. (And limits opportunity for further extensions).

"SSE" commonly refers to the whole family of extensions. Writing programs that make sure to only use instructions supported by the machine they run on is necessary, and implied, and not worth cluttering our language with. (Setting function pointers is a good way to detect what's supported once at startup, avoiding a branch to select an appropriate function every time one is needed.)

Further SSE extensions are not expected: AVX introduced a new 3-operand version of all SSE instructions, as well as some new features (including dropping alignment requirements, except for explicitly-aligned moves like vmovdqa). Further vector extensions will be called AVX-something, until Intel comes up with something different enough to change the name again.

History

SSE, first introduced with the Pentium III in 1999, was Intel's reply to AMD's 3DNow extension in 1998.

The original-SSE added vector single-precision floating point math. Integer instructions to operate on xmm registers (instead of 64bit mmx regs) didn't appear until SSE2.

Original-SSE can be considered somewhat half-hearted insofar as it only covered the most basic operations and suffered from severe limitations both in functionality and performance, making it mostly useful for a few select applications, such as audio or raster image processing.

Most of SSE's limitations have been ameliorated with the SSE2 instruction set, the only notable limitation remaining to date is the lack of horizontal addition or a dot product operation in both an efficient way and widely available. While SSE3 and SSE4.1 added horizontal add and dot product instructions, they're usually slower than manual shuffle+add. Only use them at the end of a loop.

The lack of cross-manufacturer support made software development with SSE a challenge during the initial years. With AMD's adoption of SSE2 into its 64bit processors during 2003/2004, this problem gradually disappeared. As of today, there exist virtually no processors without SSE/SSE2 support. SSE2 is part of x86-64 baseline, with twice as many vector registers available in 64bit mode.

2314 questions
1
vote
1 answer

Find index of unaligned int or long in byte array using SIMD

I have a byte sequence that I want to scan to find index of an integer (or long) value. It can be at any byte offset, not necessarily a multiple of the size. Specifically I am interested in first occurence but an example for all indexes will also…
ömer hayyam
  • 173
  • 6
1
vote
0 answers

Data movement speed between GPR-XMM and Memory-XMM

Suppose: General Purpose Register (GPR) like r8 is holding value 3.14. r9 is holding value address of 2.71 in memory. Which one faster: This movq xmm0, r8 //reading 3.14 from r8 movq r8, xmm0 //writing 3.14 to r8 Or this movsd xmm1, [r9]…
Citra Dewi
  • 213
  • 3
  • 12
1
vote
0 answers

SSE code using intrinsics runs about as fast as regular code

I wrote some code to test/play around comparing regular C++ code with SSE intrinsics. What I noticed is that both sections of the code shown below run at similar times, usually with a difference of 5-10%. Naïvely, I'd expect something more…
1
vote
2 answers

SSE interleave/merge/combine 2 vectors using a mask, per-element conditional move?

Essentially i am trying to implement a ternary-like operation on 2 SSE (__m128) vectors. The mask is another __m128 vector obtained from _mm_cmplt_ps. What i want to achieve is to select element of vector a when the corresponding element of the mask…
JustClaire
  • 451
  • 3
  • 11
1
vote
0 answers

SSE code to find maximum of array of integers

I am working on optimizing my c++ code to write in to SSE instructions. I am working on the loop where we are. finding the max of vector. void findMax(vector & index) auto size = index.size(); Uint64_t max_val = index[0]; …
1
vote
2 answers

Is there a better way to any detect bits that are set in a 16-byte array of flags?

ALIGNTO(16) uint8_t noise_frame_flags[16] = { 0 }; // Code detects noise and sets noise_frame_flags omitted __m128i xmm0 = _mm_load_si128((__m128i*)noise_frame_flags); bool isNoiseToCancel = _mm_extract_epi64(xmm0, 0)…
the kamilz
  • 1,860
  • 1
  • 15
  • 19
1
vote
0 answers

movdqa segfault in custom asm script

I have the following code snippet (https://godbolt.org/z/cE1qE9fvv) which contains a naive & vectorized version of a dot product. I decided to make the vectorized version compile in standalone asm file as following: extern exit section…
Ferdinand Mom
  • 59
  • 1
  • 5
1
vote
0 answers

Is it worth zeroing an XMM register for scalar one-input one-output instructions?

Some SSE instructions take one scalar input for one scalar output, such as, sqrtss, rsqrtss, rcpss, ... These instructions don't change the upper bits of the output register, so I believe it has a dependency on the output register. Is it worth…
xiver77
  • 2,162
  • 1
  • 2
  • 12
1
vote
0 answers

Using the mask returned by _mm_cmplt_epi16() to conditionally _mm_set_epi16 using SSE 1 .. SSE4.2

I'm adding offsets to x- and y-coordinates to then get the color values at the new (x;y), but I have to make sure the coordinates are not out of bounds. So I check if the values are greater than -1 using _mm_cmplt_epi16(lane, minus_one). and I get…
BETSCH
  • 104
  • 2
  • 3
  • 8
1
vote
1 answer

Missing strlen_sse4.S results in Segmentation Fault

i'm writing a small tool written in c and met on a segmentation fault which i don't know currently how to resolve. Running in GDB gives me the following hint: Program received signal SIGSEGV, Segmentation fault. __strlen_sse42 () at…
Ruun
  • 521
  • 1
  • 7
  • 12
1
vote
0 answers

How to receive multiple statements and print them out in x86 -64 Assembly (Intel Syntax)

I'm fairly new to X86-64 assembly, and was writing a hybrid program (c++ and assembly) to get the user's name, two sides of a triangle, and one angle. my following code : Here's the prompts : segment .data align 16 NamePrompt db "--Please enter…
1
vote
1 answer

AVX divide __m256i packed 32-bit integers by two (no AVX2)

I'm looking for the fastest way to divide an __m256i of packed 32-bit integers by two (aka shift right by one) using AVX. I don't have access to AVX2. As far as I know, my options are: Drop down to SSE2 Something like AVX __m256i integer division…
GlassBeaver
  • 196
  • 4
  • 15
1
vote
1 answer

What is the 4-way SIMD version of float selection on OSX Accelerate framework?

Using the Accelerate framework from OSX, you get access to 4-way SIMD functionality where you can operate on vector floats, vector ints and vector bools. It gives you 4-way divisions e.g. and also 4-way sin,cos,tan etc. For a vector float of 4…
Bram
  • 7,440
  • 3
  • 52
  • 94
1
vote
2 answers

count number of unique values in a 128bit avx vector, or detecting if all elements are equal?

I'm optimizing a hot path in my codebase and i have turned to vectorization. Keep in mind, I'm still quite new to all of this SIMD stuff. Here is the problem I'm trying to solve, implemented using non-SIMD inline int count_unique(int c1, int c2, int…
Cloud11665
  • 94
  • 1
  • 6
1
vote
1 answer

Replace `movss xmm0, cs:dword_5B27420` with `movss xmm0, immediate`

I have a linux .so file in Ida Pro and I have the following instruction: movss xmm0, cs:dword_5B27420 Is it possible to move a fixed value into xmm0 using the same or less number of bytes than that instruction? The instruction bytes are: F3 0F 10…