Questions tagged [sse]

SSE (Streaming SIMD Extensions) was the first of many similarly-named vector extensions to the x86 instruction set. At this point, SSE more often a catch-all for x86 vector instructions in general, and not a reference to SSE without SSE2, SSE3, etc. (For Server-Sent Events use [server-sent-events] tag instead)

See the tag wiki for guides and other resources for programming and optimising programs using x86 vector extensions.

SIMD / SSE basics: What are the 128-bit to 512-bit registers used for? with links to many examples.


SSE/SIMD vector programming guides, focused on the SIMD aspect rather than general x86:

  • Agner Fog's Optimizing Assembly guide has a chapter on vectors, including tables of data movement instructions: broadcasts within a vector, combine data between two vectors, different kinds of shuffles, etc. It's great for finding the right instruction (on intrinsic) for the data movement you need.

  • Crunching Numbers with AVX and AVX2: An intro with examples of using C++ intrinsics

  • Slides + text: SIMD at Insomniac Games (GDC 2015): intro to SIMD, and some specific examples: checking all doors against all characters in a level. Advanced tricks: Filtering an array into a smaller array (using Left-packing based on a compare mask), with an SSSE3 pshufb solution and an SSE2 move-distance solution. Also: generating N-bit masks for variable-per-element N. Including a clever float-exponent based SSE2 version.


Instruction-set / intrinsics reference guides (see the x86 tag wiki for more links)


Miscellaneous specific things:


Streaming SIMD Extensions (SSE) basics

Together, the various SSE extensions allow working with 128b vectors of float, double, or integer (from 8b to 64b) elements. There are instructions for arithmetic, bitwise operations, shuffles, blends (conditional moves), compares, and some more-specialized operations (e.g. SAD for multimedia, carryless-multiply for crypto/finite-field math, strings (for strstr() and so on)). FP sqrt is provided, but unlike the x87 FPU, math library functions like sin must be implemented by software. SSE for scalar FP math has replaced x87 floating point, now that hardware support is near-universal.

Efficient use usually requires programs to store their data in contiguous chunks, so it can be loaded in chunks of 16B and used without too much shuffling. SSE doesn't offer loads / stores with a stride, only packed. (SoA vs. AoS: structs-of-arrays vs. arrays-of-structs). Alignment requirements on memory operands can also be a hurdle, even though modern hardware has fast unaligned loads/stores.

While there are many instructions available, the instruction set is not very orthogonal. It's not uncommon to find the operation you need, but only available for elements of a different size than you're working with. Another good example is that floating point shuffles (SHUFPS) have different semantics than 32b-integer shuffles (PSHUFD).

Details

SSE added new architectural registers (xmm0-xmm7, 128b each (xmm0-xmm15 in 64bit mode)), requiring OS support to save/restore them on context switches. The previous MMX extensions (for integer SIMD) reused the x87 FP registers.

Intel introduced MMX, original-SSE, SSE2, SSE3, SSSE3, SSE4.1, and SSE4.2. AMD's XOP (a revision of their SSE5 plans) was never picked up by Intel, and will be dropped even by future AMD designs. The instruction-set war between Intel and AMD has led to many sub-optimal results, which makes instruction decoders in CPUs require more power and transistors. (And limits opportunity for further extensions).

"SSE" commonly refers to the whole family of extensions. Writing programs that make sure to only use instructions supported by the machine they run on is necessary, and implied, and not worth cluttering our language with. (Setting function pointers is a good way to detect what's supported once at startup, avoiding a branch to select an appropriate function every time one is needed.)

Further SSE extensions are not expected: AVX introduced a new 3-operand version of all SSE instructions, as well as some new features (including dropping alignment requirements, except for explicitly-aligned moves like vmovdqa). Further vector extensions will be called AVX-something, until Intel comes up with something different enough to change the name again.

History

SSE, first introduced with the Pentium III in 1999, was Intel's reply to AMD's 3DNow extension in 1998.

The original-SSE added vector single-precision floating point math. Integer instructions to operate on xmm registers (instead of 64bit mmx regs) didn't appear until SSE2.

Original-SSE can be considered somewhat half-hearted insofar as it only covered the most basic operations and suffered from severe limitations both in functionality and performance, making it mostly useful for a few select applications, such as audio or raster image processing.

Most of SSE's limitations have been ameliorated with the SSE2 instruction set, the only notable limitation remaining to date is the lack of horizontal addition or a dot product operation in both an efficient way and widely available. While SSE3 and SSE4.1 added horizontal add and dot product instructions, they're usually slower than manual shuffle+add. Only use them at the end of a loop.

The lack of cross-manufacturer support made software development with SSE a challenge during the initial years. With AMD's adoption of SSE2 into its 64bit processors during 2003/2004, this problem gradually disappeared. As of today, there exist virtually no processors without SSE/SSE2 support. SSE2 is part of x86-64 baseline, with twice as many vector registers available in 64bit mode.

2314 questions
1
vote
3 answers

Call libmvec functions manually on __m128 vectors?

According to this page https://sourceware.org/glibc/wiki/libmvec, I should be able to manually vectorize a few complicated instructions like cosine by using the libmvec functions. However, I don't know how to get gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1)…
Simon Goater
  • 759
  • 1
  • 1
  • 7
1
vote
2 answers

How to check whether odd lane is in a given range when its previous even lane equals to some value using SIMD?

This question is an extension of How to check if even/odd lanes are in given ranges using SIMD?. Given a __m128i which stores 16 chars, the even-index lane refers to even lane (i.e., lanes at 0, 2, 4, ..., 14), and odd-index lane refers to odd lane…
chenzhongpu
  • 6,193
  • 8
  • 41
  • 79
1
vote
2 answers

x86 Intrinsic: How to optimize outer/inner loop of FIR

The following code is used to calculate FIR: void Fir(float* pIn, float* pOut, float* pCoeff, float* pStage, uint32_t N, uint32_t FilterLength) { int n, k; float* pSrc; float* pCoeffSrc = pCoeff; float* pDst = pOut; float s0,…
Zvi Vered
  • 459
  • 2
  • 8
  • 16
1
vote
1 answer

Split 16-bit vector (__m128i) into 2 vectors of odd and even positions with Intel intrinsics

__m128i a = {1,2,3,4,5,6,7,8}; //8x16bit I want to split this register into 2 vectors each contains 4x32bit : __m128i x = {1,3,5,7} __m128i y = {2,4,6,8} Is it possible with intrinsic code ? In RAM, I have raw data of 16bits words. e.g:…
Zvi Vered
  • 459
  • 2
  • 8
  • 16
1
vote
0 answers

Cross compile C++ for ARM64/x86_64, using clang, with core2-duo enabled

OK, so I am new to cross compilation. I am writing some shell-scripts to compile some C++ files, on my Mac. I want to build a "Fat universal binary", so I want this to work for Arm64 and x86_64. After a lot of searching, I found using: --arch arm64…
boytheo
  • 133
  • 4
1
vote
0 answers

how to cast __m128 to union when returning

I want to return the result of _mm_add_ps() but the returning type should be a custon union that has __m128 member inside. I tested the performance of returning __m128 and a custom union. It seems that on MSVC this: return _mm_add_ps(V1, V2); is…
Zer0day
  • 89
  • 5
1
vote
1 answer

In SIMD, SSE2,many instructions named as "_mm_set_epi8","_mm_cmpgt_epi8 " and so on,what does "mm" "epi" mean?

I see many instruction with shorthand such as "_mm_and_si128". I want to know what does the "mm" mean.
dongwang
  • 13
  • 2
1
vote
0 answers

MOVDQU vs MOVDQA Instruction (x86/x64 assembly) better insights

First of all, let's start with the following links about MOVDQA and MOVDQU which are already in this community: MOVDQU instruction + page boundary MOVUPD vs. MOVDQU (x86/x64 assembly) Difference between MOVDQA and MOVAPS x86 instructions? Assembly…
RajibTheKing
  • 1,234
  • 1
  • 15
  • 35
1
vote
0 answers

The correct way to search for a substring in a string

the most part of my question is: how to deal with the cases when a string loaded into __m128i contains only part of a substring? the requirement: to search escaped sequences or the '"' (double quoting, not escaped) at the same time. examples: (there…
niXman
  • 71
  • 7
1
vote
0 answers

_mm_load_si128 is NOT throwing on unaligned access

Intel's manual mentions that, it may generate exception, wording seems a little bit interesting. Load 128-bits of integer data from memory into dst. mem_addr must be aligned on a 16-byte boundary or a general-protection exception may be…
Hasan Emrah Süngü
  • 3,488
  • 1
  • 15
  • 33
1
vote
0 answers

How to understand which constraints are not working in an inline code assembly?

When I run the assembly code bellow it gets me the error: impossible constraint in ‘asm’. int main(){ int constant[4] = {0xff, 0xff, 0xff, 0xff}; int source_image[8] = {10,56,54,88,61,250,80,157}; int negative_image[8]; int len = 8; …
1
vote
0 answers

AVX2/VCL : static/dynamic lane scheduling

I have been trying to speed up a binary tree evaluation algo using AVX2. Actually, I'm using Agner's VCL lib since the difference between hand-coding the algo and using vcl was small for big gain in readability. I have a list of trees that need to…
David Jobet
  • 111
  • 1
  • 5
1
vote
1 answer

sse4 packed sum between int32_t and int16_t (sign extend to int32_t)

I have the following code snippet (a gist can be found here) where I am trying to do a sum between 4 int32_t negative values and 4 int16_t values (that will be sign extend to int32_t). extern exit global _start section .data a: …
Ferdinand Mom
  • 59
  • 1
  • 5
1
vote
0 answers

Qemu invalid instruction trap on SSE instruction

Working with NYU's fork of MIT's xv6 operating system, we found we would get crashes under GCC 11 & 12 due to default usage of SSE2 instructions under -O0. Problem is I don't know why. Issue is first encountered during an entirely innocent struct…
nickelpro
  • 2,537
  • 1
  • 19
  • 25
1
vote
1 answer

SSE/AVX: using float shuffles + casts as substitute for missing integer shuffle intrinsics?

Is it always ok to simply use float shuffles + casts as substitute for missing integer shuffle intrinsics in SSE/AVX, like this: __m128i x = _mm_castps_si128( _mm_shuffle_ps ( _mm_castsi128_ps(y), ... In theory this should, of course, work with…
zx-81
  • 103
  • 5