Questions tagged [sse2]

x86 Streaming SIMD Extensions 2 adds support for packed integer and double-precision floats in the 128-byte XMM vector registers. It is always supported on x86-64, and supported on every x86 CPU from 2003 or later.

See the x86 tag wiki for guides and other resources for programming and optimising programs using x86 vector extensions, and the SSE tag wiki for other SSE- and SSE2-related resources.


SSE2 is one of the SSE family of x86 instruction-set extensions.

SSE2 adds support for double-precision floating point, and packed-integer (8bit to 64bit elements) in XMM registers. It is baseline in x86-64, so 64bit code can always assume SSE2 support, without having to check. 32bit code could still be run on a CPU from before 2003 (Athlon XP or Pentium III) that didn't support SSE2, but this is unlikely for most newly-written code. (And so an MMX or original-SSE fallback is not worth writing.)

Most tasks that benefit from vectors at all can be done fairly efficiently using only instructions up to SSE2. This is fortunate, because widespread support for later SSE versions took time. Use of later SSE extensions typically saves a couple instructions here and there, usually with only minor speed-ups. Notably absent until SSSE3 was PSHUFB, a shuffle whose operation was controlled by elements in a register, rather than a compile-time constant imm8. It can do things that SSE2 can't do efficiently at all.

AVX provides 3-operand versions of all SSE2 instructions.

History

Intel introduced SSE2 with their Pentium 4 design in 2001.

SSE2 was adopted by AMD for its 64bit CPU line in 2003/2004. As of 2009 there remain few if any x86 CPUs (at least, in any significant numbers) that do not support the SSE2 instruction set, which makes it extremely attractive on the Windows PC platform by offering a large feature set that can practically be assumed a "minimum requirement" that will be omnipresent (which, however, at least in 32bit mode, does not remove the necessity to check processor features).

More recent instruction sets introduce fewer features which are often highly specialized, and are at the same time supported inconsistenly between manufacturers by a significantly smaller share of processors (10-50% in 2009).

SSE2 does not offer instructions for horizontal addition, which are needed for some geometric calculations (e.g. dot product) and complex arithmetic. This functionality has to be emulated with one or several shuffles, which however are often not significantly slower than the dedicated instructions in higher revisions.

275 questions
5
votes
1 answer

SSE2 instruction to typecast an integer register to short register and vice-versa

Is there any SSE2 instruction to typecast an integer register to short register and vice-versa? Please suggest.
Andy
  • 157
  • 1
  • 6
5
votes
1 answer

#error "SSE2 instruction set not enabled" when including

I´m trying to compile some C++ code with cmake and make that uses the include and get the following make error: #error "SSE2 instruction set not enabled" I have an Intel Celeron Dual Core processor with a Linux (Mint) system (Kernel…
Suzana
  • 4,251
  • 2
  • 28
  • 52
5
votes
2 answers

strange error during cast to __m128i

I'm trying to cast unsigned short array to __m128i: const unsigned short x[] = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15}; const unsigned short y[] = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15}; __m128i n = *(__m128i*)…
stack_user
  • 59
  • 1
  • 2
4
votes
1 answer

Parallelizing code sse intrinsics functions c with openmp

I have a c code with intrinsics functions sse2. I am trying to parallelize this code. This code has recursive type sentences: *dex = _mm_add_pd(*dex,temp2); I can not use the clause reduction, because I think that can only be used with type…
user1260391
  • 1,237
  • 2
  • 10
  • 6
4
votes
1 answer

SSE2 for double calculations with GCC

How can I use SSE2 in GCC? I want to work with double values. I search s.th. like this: http://vrm-vrm.blogspot.com/2009/10/gcc-intrinsics.html only for double values.
cl_progger
  • 413
  • 2
  • 6
  • 10
4
votes
1 answer

What are the names and meanings of the intrinsic vector element types, like epi64x or pi32?

The intel intrinsic functions have the subtype of the vector built into their names. For example, _mm_set1_ps is a ps, which is a packed single-precision aka. a float. Although the meaning of most of them is clear, their "full name" like packed…
Brotcrunsher
  • 1,964
  • 10
  • 32
4
votes
2 answers

How to best emulate the logical meaning of _mm_slli_si128 (128-bit bit-shift), not _mm_bslli_si128

Looking through the intel intrinsics guide, I saw this instruction. Looking through the naming pattern, the meaning should be clear: "Shift 128-bit register left by a fixed number of bits", but it is not. In actuality it shifts by a fixed number of…
lennartVH01
  • 198
  • 1
  • 8
4
votes
2 answers

SSE4.1 unsigned integer comparison with overflow

Is there any way to perform a comparison like C >= (A + B) with SSE2/4.1 instructions considering 16 bit unsigned addition (_mm_add_epi16()) can overflow? The code snippet looks like- #define _mm_cmpge_epu16(a, b) _mm_cmpeq_epi16(_mm_max_epu16(a,…
Kaustubh
  • 73
  • 4
4
votes
2 answers

Why do x86 FP compares set CF like unsigned integers, instead of using signed conditions?

The following documentation is provided in the Intel Instruction Reference for the COMISD instruction: Compares the double-precision floating-point values in the low quadwords of operand 1 (first operand) and operand 2 (second operand), and…
St.Antario
  • 26,175
  • 41
  • 130
  • 318
4
votes
2 answers

Add+Mul become slower with Intrinsics - where am I wrong?

Having this array: alignas(16) double c[voiceSize][blockSize]; This is the function I'm trying to optimize: inline void Process(int voiceIndex, int blockSize) { double *pC = c[voiceIndex]; double value = start + step * delta; double…
markzzz
  • 47,390
  • 120
  • 299
  • 507
4
votes
1 answer

How to load two packed 64-bit quadwords into a 128-bit xmm register

I have two UInt64 (i.e. 64-bit quadword) integers. they are aligned to an 8-byte (sizeof(UInt64)) boundary (i could also align them to 16-byte if that's useful for anything) they are packed together so they are side-by-side in memory How do i load…
Ian Boyd
  • 246,734
  • 253
  • 869
  • 1,219
4
votes
1 answer

Speeding up some SSE2 Intrinsics for color conversion

I'm trying to perform image colour conversion from YCbCr to BGRA (Don't ask about the A bit, such a headache). Anyway, this needs to perform as fast as possible, so I've written it using compiler intrinsics to take advantage of SSE2. This is my…
Ali Parr
  • 4,737
  • 3
  • 31
  • 35
4
votes
1 answer

Fast copy every second byte to new memory area

I need a fast way to copy every second byte to a new malloc'd memory area. I have a raw image with RGB data and 16 bits per channel (48 bit) and want to create an RGB image with 8 bits per channel (24 bit). Is there a faster method than copying…
akw
  • 2,090
  • 1
  • 15
  • 21
4
votes
2 answers

Converting unsigned chars to float in assembly (to prepare for float vector calculations)

I am trying to optimize a function using SSE2. I'm wondering if I can prepare the data for my assembly code better than this way. My source data is a bunch of unsigned chars from pSrcData. I copy it to this array of floats, as my calculation…
Warpin
  • 6,971
  • 12
  • 51
  • 77
4
votes
4 answers

How To Store Values In Non-Contiguous Memory Locations With SSE Intrinsics?

I'm very new to SSE and have optimized a section of code using intrinsics. I'm pleased with the operation itself, but I'm looking for a better way to write the result. The results end up in three _m128i variables. What I'm trying to do is store…
Scott
  • 766
  • 7
  • 20