Questions tagged [sse]

SSE (Streaming SIMD Extensions) was the first of many similarly-named vector extensions to the x86 instruction set. At this point, SSE more often a catch-all for x86 vector instructions in general, and not a reference to SSE without SSE2, SSE3, etc. (For Server-Sent Events use [server-sent-events] tag instead)

See the tag wiki for guides and other resources for programming and optimising programs using x86 vector extensions.

SIMD / SSE basics: What are the 128-bit to 512-bit registers used for? with links to many examples.


SSE/SIMD vector programming guides, focused on the SIMD aspect rather than general x86:

  • Agner Fog's Optimizing Assembly guide has a chapter on vectors, including tables of data movement instructions: broadcasts within a vector, combine data between two vectors, different kinds of shuffles, etc. It's great for finding the right instruction (on intrinsic) for the data movement you need.

  • Crunching Numbers with AVX and AVX2: An intro with examples of using C++ intrinsics

  • Slides + text: SIMD at Insomniac Games (GDC 2015): intro to SIMD, and some specific examples: checking all doors against all characters in a level. Advanced tricks: Filtering an array into a smaller array (using Left-packing based on a compare mask), with an SSSE3 pshufb solution and an SSE2 move-distance solution. Also: generating N-bit masks for variable-per-element N. Including a clever float-exponent based SSE2 version.


Instruction-set / intrinsics reference guides (see the x86 tag wiki for more links)


Miscellaneous specific things:


Streaming SIMD Extensions (SSE) basics

Together, the various SSE extensions allow working with 128b vectors of float, double, or integer (from 8b to 64b) elements. There are instructions for arithmetic, bitwise operations, shuffles, blends (conditional moves), compares, and some more-specialized operations (e.g. SAD for multimedia, carryless-multiply for crypto/finite-field math, strings (for strstr() and so on)). FP sqrt is provided, but unlike the x87 FPU, math library functions like sin must be implemented by software. SSE for scalar FP math has replaced x87 floating point, now that hardware support is near-universal.

Efficient use usually requires programs to store their data in contiguous chunks, so it can be loaded in chunks of 16B and used without too much shuffling. SSE doesn't offer loads / stores with a stride, only packed. (SoA vs. AoS: structs-of-arrays vs. arrays-of-structs). Alignment requirements on memory operands can also be a hurdle, even though modern hardware has fast unaligned loads/stores.

While there are many instructions available, the instruction set is not very orthogonal. It's not uncommon to find the operation you need, but only available for elements of a different size than you're working with. Another good example is that floating point shuffles (SHUFPS) have different semantics than 32b-integer shuffles (PSHUFD).

Details

SSE added new architectural registers (xmm0-xmm7, 128b each (xmm0-xmm15 in 64bit mode)), requiring OS support to save/restore them on context switches. The previous MMX extensions (for integer SIMD) reused the x87 FP registers.

Intel introduced MMX, original-SSE, SSE2, SSE3, SSSE3, SSE4.1, and SSE4.2. AMD's XOP (a revision of their SSE5 plans) was never picked up by Intel, and will be dropped even by future AMD designs. The instruction-set war between Intel and AMD has led to many sub-optimal results, which makes instruction decoders in CPUs require more power and transistors. (And limits opportunity for further extensions).

"SSE" commonly refers to the whole family of extensions. Writing programs that make sure to only use instructions supported by the machine they run on is necessary, and implied, and not worth cluttering our language with. (Setting function pointers is a good way to detect what's supported once at startup, avoiding a branch to select an appropriate function every time one is needed.)

Further SSE extensions are not expected: AVX introduced a new 3-operand version of all SSE instructions, as well as some new features (including dropping alignment requirements, except for explicitly-aligned moves like vmovdqa). Further vector extensions will be called AVX-something, until Intel comes up with something different enough to change the name again.

History

SSE, first introduced with the Pentium III in 1999, was Intel's reply to AMD's 3DNow extension in 1998.

The original-SSE added vector single-precision floating point math. Integer instructions to operate on xmm registers (instead of 64bit mmx regs) didn't appear until SSE2.

Original-SSE can be considered somewhat half-hearted insofar as it only covered the most basic operations and suffered from severe limitations both in functionality and performance, making it mostly useful for a few select applications, such as audio or raster image processing.

Most of SSE's limitations have been ameliorated with the SSE2 instruction set, the only notable limitation remaining to date is the lack of horizontal addition or a dot product operation in both an efficient way and widely available. While SSE3 and SSE4.1 added horizontal add and dot product instructions, they're usually slower than manual shuffle+add. Only use them at the end of a loop.

The lack of cross-manufacturer support made software development with SSE a challenge during the initial years. With AMD's adoption of SSE2 into its 64bit processors during 2003/2004, this problem gradually disappeared. As of today, there exist virtually no processors without SSE/SSE2 support. SSE2 is part of x86-64 baseline, with twice as many vector registers available in 64bit mode.

2314 questions
1
vote
2 answers

SSE2 instruction in C code

I am trying to reverse engineer a c code, but this part of assembly I cant really understand. I know it is part of the SSE extension. However, somethings are really different than what I am used to in x86 instructions. static int sad16_sse2(void *v,…
Keeto
  • 4,074
  • 9
  • 35
  • 58
1
vote
1 answer

Dependence speed up on data size using auto vectorization and sse

I'm trying to speed up some code using auto vectorization from Intel Compiler and using sse. All computations are transformation some struct node_t to another struct w_t (functions tr() and gen_tr()). When I try vectorize function gen_tr() it does…
Nik0las
  • 53
  • 4
1
vote
1 answer

Can I compile OpenCL code into ordinary, OpenCL-free binaries?

I am evaluating OpenCL for my purposes. It occurred to me that you can't assume it working out-of-the-box on either Windows or Mac because: Windows needs an OpenCL driver (which, of course, can be installed) MacOS supports OpenCL only on MacOS >=…
clemens
  • 422
  • 2
  • 12
1
vote
1 answer

simd store delay

I have the following type of code short v[8] __attribute__ (( aligned(16))); ... // in an inlined function : _mm_store_si128(v, some_m128i_value); ... // some more operation (4 additions ) outp[0] = v[1] / 2; // <- first access of v since the…
shodanex
  • 14,975
  • 11
  • 57
  • 91
1
vote
2 answers

What is the equivalent of v4sf and __attribute__ in Visual Studio C++?

typedef float v4sf __attribute__ ((mode(V4SF))); This is in GCC. Anyone knows the equivalence syntax? VS 2010 will show __attribute__ has no storage class of this type, and mode is not defined. I searched on the Internet and it said Equivalent to…
CppLearner
  • 16,273
  • 32
  • 108
  • 163
1
vote
3 answers

multiplication using SSE (x*x*x)

I'm trying to optimize a cube function using SSE long cube(long n) { return n*n*n; } I have tried this : return (long) _mm_mul_su32(_mm_mul_su32((__m64)n,(__m64)n),(__m64)n); And the performance was even worse (and yes I have never done…
sherif
  • 2,282
  • 19
  • 21
1
vote
1 answer

Confuse about the bitmap of XMM register

Sorry I don't have a good title... I was reading this thread: Vector Matrix Multiplication In SSE The original poster had the following code // xmm0 = (v0,v1,v2,v3) movups xmm0, [eax] // xmm0 = (v0,v0,v0,v0) // xmm1 = (v1,v1,v1,v1) // xmm2 =…
CppLearner
  • 16,273
  • 32
  • 108
  • 163
1
vote
1 answer

Inline-Assembler-Code in C, copy values from Array to xmm

I have two Arrays and I want to get the dot product. How do I get the values of vek and vec into xmm0 and xmm1? And how do I get the Value standing in xmm1 (??) so that I can use it for "printf"? #include main(){ float vek[4] = {4.0, 3.0,…
degude
  • 365
  • 2
  • 4
  • 10
1
vote
1 answer

SSE instruction sanity check

The code below has me slightly perplexed: function(__m128 foo) { __m128 bar = _mm_shuffle_ps(foo, foo, _MM_SHUFFLE(2,2,2,2)) } Is it just taking the 2nd word of foo and pasting it 4 times into bar or does it do something else as well?
Michael Dorgan
  • 12,453
  • 3
  • 31
  • 61
1
vote
1 answer

SSE intrinsics for comparison (_mm_cmpeq_ps) and assignment operation

I have started optimising my code using SSE. Essentially it is a ray tracer that processes 4 rays at a time by storing the coordinates in __m128 data types x, y, z (the coordinates for the four rays are grouped by axis). However I have a branched…
cubiclewar
  • 1,569
  • 3
  • 20
  • 37
1
vote
1 answer

Really basic SSE

I have a very simple program that I am trying to improve performance. One way that I know will help is to utilize SSE3 (since the machine that I am working supports this), but I have absolutely no idea how to to do this. Here is a code snippet…
AndroidDev
  • 20,466
  • 42
  • 148
  • 239
1
vote
2 answers

SSE Vector Comparison with Epsilon

I am writing software that needs to compare two _mm256 vectors for equality. However, I would like there to be a margin of error +/- 0.00001. Eg, 3.00001 should be considered equal to 3.00002. Is there a simple way to do this using SSE/AVX/AVX2…
1
vote
1 answer

First movss, then unpcklps with zeroes, not changing the high zeros. Why?

I am new to x86 and have no experience in it, so this code looks kinda obsolete to me. Is there any purpose in doing this? The instructions are: rcx+000003F8 = 32bit float xmm0 = 0 (all 128bits) movss xmm4,[rcx+000003F8] unpcklps…
1
vote
0 answers

Sum of bytes in an __m128 register

I am trying to find the sum of all bytes in an __m128 register using SSE and SSE2. So far what I have is __m128i sum = _mm_sad_epu8(bytes, _mm_setzero_si128()); return _mm_cvtsi128_si32(sum) + _mm_extract_epi16(sum, 4); where bytes is the __m128…
1
vote
1 answer

Improving the accuracy in SIMD summation

I am investigating the performance of SIMD summation in terms of speed and accuracy. To conduct this analysis, let's consider a scenario where I have multiple 1D arrays of doubles combined into a large 1D array, stored contiguously in memory. My…
user3116936
  • 492
  • 3
  • 21