Questions tagged [sse2]

x86 Streaming SIMD Extensions 2 adds support for packed integer and double-precision floats in the 128-byte XMM vector registers. It is always supported on x86-64, and supported on every x86 CPU from 2003 or later.

See the x86 tag wiki for guides and other resources for programming and optimising programs using x86 vector extensions, and the SSE tag wiki for other SSE- and SSE2-related resources.


SSE2 is one of the SSE family of x86 instruction-set extensions.

SSE2 adds support for double-precision floating point, and packed-integer (8bit to 64bit elements) in XMM registers. It is baseline in x86-64, so 64bit code can always assume SSE2 support, without having to check. 32bit code could still be run on a CPU from before 2003 (Athlon XP or Pentium III) that didn't support SSE2, but this is unlikely for most newly-written code. (And so an MMX or original-SSE fallback is not worth writing.)

Most tasks that benefit from vectors at all can be done fairly efficiently using only instructions up to SSE2. This is fortunate, because widespread support for later SSE versions took time. Use of later SSE extensions typically saves a couple instructions here and there, usually with only minor speed-ups. Notably absent until SSSE3 was PSHUFB, a shuffle whose operation was controlled by elements in a register, rather than a compile-time constant imm8. It can do things that SSE2 can't do efficiently at all.

AVX provides 3-operand versions of all SSE2 instructions.

History

Intel introduced SSE2 with their Pentium 4 design in 2001.

SSE2 was adopted by AMD for its 64bit CPU line in 2003/2004. As of 2009 there remain few if any x86 CPUs (at least, in any significant numbers) that do not support the SSE2 instruction set, which makes it extremely attractive on the Windows PC platform by offering a large feature set that can practically be assumed a "minimum requirement" that will be omnipresent (which, however, at least in 32bit mode, does not remove the necessity to check processor features).

More recent instruction sets introduce fewer features which are often highly specialized, and are at the same time supported inconsistenly between manufacturers by a significantly smaller share of processors (10-50% in 2009).

SSE2 does not offer instructions for horizontal addition, which are needed for some geometric calculations (e.g. dot product) and complex arithmetic. This functionality has to be emulated with one or several shuffles, which however are often not significantly slower than the dedicated instructions in higher revisions.

275 questions
2
votes
3 answers

How can I implement Bit Shift Right and Bit Shift Left by Vector for 8-bit and 16-bit integers in SSE2?

I came access this post whilst doing research for my next project. Being able to bit shift 8 and 16-bit integers by vector using SIMD would be very useful to me and I think many other people here. Unfortunately for me, the platform my project will…
dave_thenerd
  • 448
  • 3
  • 10
2
votes
3 answers

How would you convert a "while" iterator into simd instructions?

This is the code I actually had (for a scalar code) which I've replicated (x4) storing data into simd: waveTable *waveTables[4]; for (int i = 0; i < 4; i++) { int waveTableIindex = 0; while ((phaseIncrement[i] >=…
markzzz
  • 47,390
  • 120
  • 299
  • 507
2
votes
1 answer

The right way to use function _mm_clflush to flush a large struct

I am starting to use functions like _mm_clflush, _mm_clflushopt, and _mm_clwb. Say now as I have defined a struct name mystruct and its size is 256 Bytes. My cacheline size is 64 Bytes. Now I want to flush the cacheline that contains the mystruct…
2
votes
2 answers

What is the most efficient way to do unsigned 64 bit comparison on SSE2?

PCMPGTQ doesn't exist on SSE2 and doesn't natively work on unsigned integers. Our goal here is to provide backward-compatible solutions for unsigned 64-bit comparisons so we can include them into the WebAssembly SIMD standard. This is a sister…
Dan Weber
  • 401
  • 2
  • 9
2
votes
1 answer

Strict aliasing and __m128i type

When using SSE2 intrinsic functions to do bit-wise operations, one has to cast pointers from int* to __m128i*. Does this code break strict aliasing rule? void bit_twiddling_func(int size, int const* input, int* output) { const __m128* x = (const…
pic11
  • 14,267
  • 21
  • 83
  • 119
2
votes
1 answer

How to Shuffle a Vector128 and Add the elements, then Extract a scalar value properly?

I am using Vector128 in C# in order to count matches from a byte array with 16 index. This is part of implementing a byte version of Micro Optimization of a 4-bucket histogram of a large array or list, using the technique from How to count…
Andreas
  • 1,121
  • 4
  • 17
  • 34
2
votes
1 answer

Set an XMM register to a repeating byte pattern (broadcast a constant byte)

I know that we can do something like this to move a character to a xmm register: movaps xmm1, xword [.__0x20] align 16 .__0x20 db 0x20,0x20,0x20,0x20,0x20,0x20,0x20,0x20,0x20,0x20,0x20,0x20,0x20,0x20,0x20,0x20 but since this is a memory process, i…
ELHASKSERVERS
  • 195
  • 1
  • 10
2
votes
0 answers

x86_64 asm calculate floating point to power of another floating point

Okay so I have to calculate this kind of calculation: 10^(some floating point value) where the floating point value (the exponent) is stored as double in xmm0 register calculated before with divsd xmm0, xmm1 with 64 bit double floating point…
The amateur programmer
  • 1,238
  • 3
  • 18
  • 38
2
votes
3 answers

How do you process exp() with SSE2?

I'm making a code that essentially takes advantage of SSE2 on optimizing this code: double *pA = a; double *pB = b[voiceIndex]; double *pC = c[voiceIndex]; for (int sampleIndex = 0; sampleIndex < blockSize; sampleIndex++) { pC[sampleIndex] =…
markzzz
  • 47,390
  • 120
  • 299
  • 507
2
votes
1 answer

Flush-to-zero denormals - is it reliable?

For signal processing this has been an issue like forever and right I'm still taking precautions of adding a small constant whenever a denormal can happen, e.g.: float coef = 0.9f; for (int i=0; i
Vojtěch Melda Meluzín
  • 1,117
  • 3
  • 11
  • 22
2
votes
1 answer

Array of sse type: Segmentation Fault

today I tried to initialize an array of the sse type __m128d. Unfortunately it didn't work - why? Is it generally impossible to create arrays of sse types (since they are register types?). The following code segfaults at the assignment in the…
Boom
  • 83
  • 4
2
votes
2 answers

sse/sse2 double matrix float vector multiplication

I have to implement matrix-vector multiplication using sse/sse2. Vector and matrix are large. Matrix is double, vector is float. The point is that all calculations I have to do on floats - when I get data from matrix I promote it to float, do the…
user606521
  • 14,486
  • 30
  • 113
  • 204
2
votes
0 answers

Is _MM_SET_ROUNDING_MODE(STMXCSR) thread safe?

I am using SIMD with SSE2 instruction set, I want to convert values from double to float using _mm_cvtpd_ps. But I need control over the rounding mode which is used. We are also using multithreading. So I want to know is it thread safe to use…
2
votes
1 answer

how to minimize overhead loading double into simd regsiters working with scalar SIMD intrinsics

Using gcc 7.2 at godbolt.org I can see the following code is translated in assembler quite optimally. I see 1 load, 1 addition and 1 store. #include __attribute__((alwaysinline)) double foo(double x, double y) { return…
Fabio
  • 2,105
  • 16
  • 26
2
votes
2 answers

Shifiting xmm integer register values using non-AVX instructions on Intel x86 architecture

I have the following problem which I need to solve using anything other than AVX2. I have 3 values stored in a m128i variable (the 4th value is not needed ) and need to shift those values by 4,3,5. I need two functions. One for the right logical…
CheckersGuy
  • 117
  • 10