Questions tagged [sse2]

x86 Streaming SIMD Extensions 2 adds support for packed integer and double-precision floats in the 128-byte XMM vector registers. It is always supported on x86-64, and supported on every x86 CPU from 2003 or later.

See the x86 tag wiki for guides and other resources for programming and optimising programs using x86 vector extensions, and the SSE tag wiki for other SSE- and SSE2-related resources.


SSE2 is one of the SSE family of x86 instruction-set extensions.

SSE2 adds support for double-precision floating point, and packed-integer (8bit to 64bit elements) in XMM registers. It is baseline in x86-64, so 64bit code can always assume SSE2 support, without having to check. 32bit code could still be run on a CPU from before 2003 (Athlon XP or Pentium III) that didn't support SSE2, but this is unlikely for most newly-written code. (And so an MMX or original-SSE fallback is not worth writing.)

Most tasks that benefit from vectors at all can be done fairly efficiently using only instructions up to SSE2. This is fortunate, because widespread support for later SSE versions took time. Use of later SSE extensions typically saves a couple instructions here and there, usually with only minor speed-ups. Notably absent until SSSE3 was PSHUFB, a shuffle whose operation was controlled by elements in a register, rather than a compile-time constant imm8. It can do things that SSE2 can't do efficiently at all.

AVX provides 3-operand versions of all SSE2 instructions.

History

Intel introduced SSE2 with their Pentium 4 design in 2001.

SSE2 was adopted by AMD for its 64bit CPU line in 2003/2004. As of 2009 there remain few if any x86 CPUs (at least, in any significant numbers) that do not support the SSE2 instruction set, which makes it extremely attractive on the Windows PC platform by offering a large feature set that can practically be assumed a "minimum requirement" that will be omnipresent (which, however, at least in 32bit mode, does not remove the necessity to check processor features).

More recent instruction sets introduce fewer features which are often highly specialized, and are at the same time supported inconsistenly between manufacturers by a significantly smaller share of processors (10-50% in 2009).

SSE2 does not offer instructions for horizontal addition, which are needed for some geometric calculations (e.g. dot product) and complex arithmetic. This functionality has to be emulated with one or several shuffles, which however are often not significantly slower than the dedicated instructions in higher revisions.

275 questions
0
votes
1 answer

latency for 'pcmpeqb' - memory vs xmm register

i have these 2 options: option 1: loop: ... movdqu xmm0, [rax] pcmpeqb xmm0, [.zero_table] ... ... align 16 .zero_table: DQ 0, 0 option 2: pxor xmm1, xmm1 loop: ... movdqu xmm0, [rax] pcmpeqb xmm0, xmm1 ... …
ELHASKSERVERS
  • 195
  • 1
  • 10
0
votes
1 answer

Why AVX2 and SSE2 bitwise OR operators are not faster than a simple | operator?

I am trying to speed-up a bitwise OR operation for very long binary vectors using integers of 32 bit. In this example we can assume that nwords is the number of words and it is a multiple of 4 and 8. Hence, no loop reminder. This binary vector can…
Liotro78
  • 111
  • 5
0
votes
2 answers

Mapping hex digits to contiguous integers: GCC's switch works 1.5x faster than my hand-written SSE2 intrinsics cmpeq / movemask / bsf?

I have a function which gets a character and checks it then return another character (depends on received character). I used (switch) to check the provided character and return what we want but I need more speed so I used (SSE2) too. My SSE2…
Jason
  • 75
  • 1
  • 6
0
votes
1 answer

Sorting tuples inside signed integers

I'm sorting tuples of 16+16 bits as 32bit integers with SSE2. There are only signed integer instructions for compare and min/max. I don't have a problem with the order for the higher part as its just a hash. But entries with negative hashes will be…
alecco
  • 2,914
  • 1
  • 28
  • 37
0
votes
1 answer

How to convert a ps vector of 4 float to 4 doubles and store to a pd array?

Is it possible with SSE2/SIMD to store __m128 values (4 float) to an array of double? I need to switch from this code: double *pC = c[voiceIndex]; __m128d v_result; _mm_store_pd(pC, v_result); to this: double *pC = c[voiceIndex]; __m128…
markzzz
  • 47,390
  • 120
  • 299
  • 507
0
votes
1 answer

How to convert two _pd into one _ps?

I'm looping some data, calculating some double and every 2 __m128d operations, I want to store the data on a __m128 float. So 64+64 + 64+64 (2 __m128d) stored into 1 32+32+32+32 __m128. I do somethings like this: __m128d v_result; __m128…
markzzz
  • 47,390
  • 120
  • 299
  • 507
0
votes
1 answer

SSE2 Instruction, PMULUDQ Multiplication Question

In the code I am debugging, there's an assembly instruction as shown below: pmuludq xmm6, xmm1 xmm6 = 0x3736353433323130 xmm1 = 0x7D35343332313938 If I multiply the above 2 numbers using Python, I get the result as shown below: >>>…
Neon Flash
  • 3,113
  • 12
  • 58
  • 96
0
votes
0 answers

Why doesn't clang allocate a constant to a register?

I'm looking at clang's output, to see what the C code: (mask==0xffff ? one : zero) This produces, where one is set like this: const __m128i one = _mm_set_epi64x(0, 1); And the assembly output: 4e0: 66 0f d7 c0 pmovmskb eax, xmm0 4e4: …
elmattic
  • 12,046
  • 5
  • 43
  • 79
0
votes
1 answer

GCC support for XMM registers badly broken?

Whenever I examine the assembly code produced by GCC for code that uses the __m128i type, I see what looks like a catastrophe. There's tons of redundant instructions that serve no purpose. And yet, as an assembly programmer I'd rather use asm{} but…
user654241
  • 79
  • 3
0
votes
0 answers

Inline SSE2 assembly crashing on data change

I have the following code in C++. Pointers _p_s1 and _p_s2 are pointing to slices (every second video lines) in a bigger memory area holding a video frame (let's call this *pFrameData). Whenever data changes in the memory area pointed by pFrameData,…
jpou
  • 1,935
  • 2
  • 21
  • 30
0
votes
1 answer

Regarding QT creator in Lubuntu 16.04 (i386) ICOP board

I have installed qtcreator in lubuntu 16.04 and when trying to open it, i am getting an error This program requires an x86 processor that supports SSE2 extension, at least a Pentium 4 or newer Aborted (core dumped) can someone help me to solve…
harsha
  • 25
  • 4
0
votes
0 answers

Why my kernel module performs float division perfectly?

I'm trying to use float and double data types inside the kernel module. As part of satisfying my curiosity, I have written simple LKM. Here it is, #include #include #include static int __init…
0
votes
1 answer

Intel load intrinsic issue

The purpose of the code is to subtract to each character of the string str a value in the key array. The non-vectorised version of the program corresponds to the last cycle in both programs. How is this code: void decode(const char* key, int m,…
spallas
  • 188
  • 1
  • 4
0
votes
2 answers

MSI install condition to check for CPU's SSE2 feature?

Starting with visual studio 2012 the SSE2 compile options are enabled per "default". also for me it's about time to go ahead and utilize that feature - and no longer manually disable that flag for my projects. However, I have seen many occasions…
Opmet
  • 1,754
  • 18
  • 20
0
votes
2 answers

Auto-vectorization in visual studio 2012 on vectors of Eigen type is not performing well

I have std::vector of Eigen::vector3d types and when i am compiling this code using Microsoft Visual Studio 2012 having the /Qvec-report:2 flag on for reporting vectorization details. It's showing Loop not vectorized due to reason 1304 (Loop…