Questions tagged [sse2]

x86 Streaming SIMD Extensions 2 adds support for packed integer and double-precision floats in the 128-byte XMM vector registers. It is always supported on x86-64, and supported on every x86 CPU from 2003 or later.

See the x86 tag wiki for guides and other resources for programming and optimising programs using x86 vector extensions, and the SSE tag wiki for other SSE- and SSE2-related resources.


SSE2 is one of the SSE family of x86 instruction-set extensions.

SSE2 adds support for double-precision floating point, and packed-integer (8bit to 64bit elements) in XMM registers. It is baseline in x86-64, so 64bit code can always assume SSE2 support, without having to check. 32bit code could still be run on a CPU from before 2003 (Athlon XP or Pentium III) that didn't support SSE2, but this is unlikely for most newly-written code. (And so an MMX or original-SSE fallback is not worth writing.)

Most tasks that benefit from vectors at all can be done fairly efficiently using only instructions up to SSE2. This is fortunate, because widespread support for later SSE versions took time. Use of later SSE extensions typically saves a couple instructions here and there, usually with only minor speed-ups. Notably absent until SSSE3 was PSHUFB, a shuffle whose operation was controlled by elements in a register, rather than a compile-time constant imm8. It can do things that SSE2 can't do efficiently at all.

AVX provides 3-operand versions of all SSE2 instructions.

History

Intel introduced SSE2 with their Pentium 4 design in 2001.

SSE2 was adopted by AMD for its 64bit CPU line in 2003/2004. As of 2009 there remain few if any x86 CPUs (at least, in any significant numbers) that do not support the SSE2 instruction set, which makes it extremely attractive on the Windows PC platform by offering a large feature set that can practically be assumed a "minimum requirement" that will be omnipresent (which, however, at least in 32bit mode, does not remove the necessity to check processor features).

More recent instruction sets introduce fewer features which are often highly specialized, and are at the same time supported inconsistenly between manufacturers by a significantly smaller share of processors (10-50% in 2009).

SSE2 does not offer instructions for horizontal addition, which are needed for some geometric calculations (e.g. dot product) and complex arithmetic. This functionality has to be emulated with one or several shuffles, which however are often not significantly slower than the dedicated instructions in higher revisions.

275 questions
0
votes
0 answers

Having array of 16/32/64 bytes how to quickly find index of first byte equal to given, using SSE2/AVX/AVX2/AVX-512

If I have array of 16 or 32 or 64 bytes (let's suppose aligned on 64-bytes memory boundary), how do I quickly find index of first byte equal to given, using SIMD SSE2/AVX/AVX2/AVX-512. If such byte not exist for example you can return index equal to…
Arty
  • 14,883
  • 6
  • 36
  • 69
0
votes
0 answers

Why some of sse intrinsics introduce move back and forth?

In my code, I set a 128-bit variable to zero. But I don't quite understand why it translates to two move instructions in assembly code? __m128i zeros = reinterpret_cast<__m128i>(_mm_setzero_pd()); Corresponding assembly code has two move back and…
0
votes
0 answers

Does packed data type in SSE2 imply alignment?

I'm writing some code that should utilize some type of vectorized instructions in order to compare two arrays consisting of 64-bit integers. I'm thinking of utilizing the SSE2 variant for cmpeq. The term I am getting stuck on is the term…
aahlback
  • 82
  • 1
  • 5
0
votes
1 answer

how to set a int32 value at some index within an m128i with only SSE2?

Is there a SSE2 intrinsics that can set a single int32 value within m128i? Such as set value 1000 at index 1 on a m128i that already contains 1,2,3,4? (which result in 1,1000,3,4)
markzzz
  • 47,390
  • 120
  • 299
  • 507
0
votes
1 answer

_mm_load_si128 loads data in reverse order

I am writing a C function with SSE2 intrinsics to essentially compare 4 32 bit integers and check to see which are greater than zero, and give that result in the form of a 16 bit mask. I am using the following code to do this #include…
Josh Weinstein
  • 2,788
  • 2
  • 21
  • 38
0
votes
1 answer

Quick workaround for SSE2 movq instruction on non-SSE2 CPUs

How could I convert a movq SSE2 instruction into a simple code snippet which I could later patch into the original EXE which cointained? Please if you could provide sample direct instructions to be used as a replacement "template", so much the…
MSC
  • 1
0
votes
3 answers

Visual Studio 2010 and SSE 4.2

I would like to know, what is necessary to set in Visual Studio 2010, to have SSE 4.2 enabled? I would like to use it because of optimized POPCNT... How can I test, if all settings are ok? thanks well, I tried to use your solution, however…
morph
  • 61
  • 1
  • 7
0
votes
2 answers

SSE2 double multiplication slower than with standard multiplication

I'm wondering why the following code with SSE2 instructions performs the multiplication slower than the standard C++ implementation. Here is the code: m_win = (double*)_aligned_malloc(size*sizeof(double), 16); __m128d* pData =…
pokey909
  • 1,797
  • 1
  • 16
  • 22
0
votes
1 answer

SSE2 registers in x86 assembly

I have the following code: global _start section .text input: mov edx, buffer_size mov ecx, buffer mov ebx, 0 ; stdin mov eax, 3 ; read int 0x80 …
user13385400
0
votes
1 answer

How to add to variable using SSE2?

How to "add to" variable using SSE2? I've recently been working with SSE2 in C++ to optimize a few math functions up, but ran into a problem when attempting to add to existing variables. I have a function which intakes variables like so: _m128d v1…
user14598236
0
votes
2 answers

How do you do signed 32bit widening multiplication on SSE2?

This question came up when reviewing the WebAssembly SIMD proposal for extended multiplication. To support older hardware, we need to support SSE2 and the only vector multiplication operation for 32 bit integers is pmuludq. (Signed pmuldq was only…
Dan Weber
  • 401
  • 2
  • 9
0
votes
0 answers

How do I compare 2 XMM registers (SSE) and test for equality to break out of loop?

After using the assembly instruction: pcmpeqd xmm2, xmm7 The result in resgister xmm2 = 00000000 00000000 FFFFFFFF 00000000 The result is correct. Unfortunately the comparison sets no flags that can be tested to break out of the loop. Also any…
John J
  • 31
  • 5
0
votes
0 answers

Can_mm_store_si128 / _mm_load_si128 intrinsics be used to implement 128 bit atomic type?

If I want to implement 128-bit atomic type on x64, can I get with _mm_store_si128 and _mm_load_si128 to avoid cmpxchg16b for relaxed load and store? (If needed, can assume that only load and store are needed, although it would be good if I can mix…
Alex Guteniev
  • 12,039
  • 2
  • 34
  • 79
0
votes
0 answers

Optimal conversion of BGRA buffer to AYUV

I have a BGRA buffer, and need to convert it to AYUV format. The function below is working properly, but is very inefficient. I know I can make it marginally faster by moving some operations out of the inner loop, but I would really like to not do…
user2349195
  • 62
  • 1
  • 8
0
votes
2 answers

SSE2 test xmm bitmask directly without using 'pmovmskb'

consider we have this: .... pxor xmm1, xmm1 movdqu xmm0, [reax] pcmpeqb xmm0, xmm1 pmovmskb eax, xmm0 test ax , ax jz .zero ... is there any way to not use 'pmovmskb' and test the bitmask…
ELHASKSERVERS
  • 195
  • 1
  • 10