Questions tagged [sse2]

x86 Streaming SIMD Extensions 2 adds support for packed integer and double-precision floats in the 128-byte XMM vector registers. It is always supported on x86-64, and supported on every x86 CPU from 2003 or later.

See the x86 tag wiki for guides and other resources for programming and optimising programs using x86 vector extensions, and the SSE tag wiki for other SSE- and SSE2-related resources.


SSE2 is one of the SSE family of x86 instruction-set extensions.

SSE2 adds support for double-precision floating point, and packed-integer (8bit to 64bit elements) in XMM registers. It is baseline in x86-64, so 64bit code can always assume SSE2 support, without having to check. 32bit code could still be run on a CPU from before 2003 (Athlon XP or Pentium III) that didn't support SSE2, but this is unlikely for most newly-written code. (And so an MMX or original-SSE fallback is not worth writing.)

Most tasks that benefit from vectors at all can be done fairly efficiently using only instructions up to SSE2. This is fortunate, because widespread support for later SSE versions took time. Use of later SSE extensions typically saves a couple instructions here and there, usually with only minor speed-ups. Notably absent until SSSE3 was PSHUFB, a shuffle whose operation was controlled by elements in a register, rather than a compile-time constant imm8. It can do things that SSE2 can't do efficiently at all.

AVX provides 3-operand versions of all SSE2 instructions.

History

Intel introduced SSE2 with their Pentium 4 design in 2001.

SSE2 was adopted by AMD for its 64bit CPU line in 2003/2004. As of 2009 there remain few if any x86 CPUs (at least, in any significant numbers) that do not support the SSE2 instruction set, which makes it extremely attractive on the Windows PC platform by offering a large feature set that can practically be assumed a "minimum requirement" that will be omnipresent (which, however, at least in 32bit mode, does not remove the necessity to check processor features).

More recent instruction sets introduce fewer features which are often highly specialized, and are at the same time supported inconsistenly between manufacturers by a significantly smaller share of processors (10-50% in 2009).

SSE2 does not offer instructions for horizontal addition, which are needed for some geometric calculations (e.g. dot product) and complex arithmetic. This functionality has to be emulated with one or several shuffles, which however are often not significantly slower than the dedicated instructions in higher revisions.

275 questions
3
votes
1 answer

Test if any byte in an xmm register is 0

I am currently teaching myself SIMD and am writing a rather simple String processing subroutine. I am however restricted to SSE2, which makes me unable to utilize ptest to find the null terminal. The way I am currently trying to find the null…
Liqs
  • 137
  • 1
  • 9
3
votes
1 answer

Bit vector operation with AVX2 and SSE2

I am new to AVX2 and SSE2 instruction sets, and I want to learn more on how to use such instruction sets to speed-up bit vector operations. So far I have used them successfully to vectorize the codes with double / float operations. In this example,…
Liotro78
  • 111
  • 5
3
votes
1 answer

How can a SSE2 function be missing from the header it is supposed to be in?

I am working with SSE2 instructions on VS2013 and I realized that some functions in the Intel documentation are missing from the header they are supposed to be in. The method void _mm_storeu_si32 (void* mem_addr, __m128i a) should be in #include…
Norgannon
  • 487
  • 4
  • 16
3
votes
1 answer

How to floor/int in double using only SSE2?

In float, it seems pretty easy to floor() and than int(), such as: float z = floor(LOG2EF * x + 0.5f); const int32_t n = int32_t(z); become: __m128 z = _mm_add_ps(_mm_mul_ps(log2ef, x), half); __m128 t = _mm_cvtepi32_ps(_mm_cvttps_epi32(z)); z =…
markzzz
  • 47,390
  • 120
  • 299
  • 507
3
votes
2 answers

What is the SSE2 assembly equivalent of intrinsics?

I'm using Fasm (assembly) and I am looking for SSE2 assembly instructions equivalents of these intrinsics instructions: _mm_set1_epi8 _mm_cmpeq_epi8 _mm_movemask_epi8 Where do I get them (web site, pdf...) ?
3
votes
1 answer

Using % with SSE2?

Here's the code I'm trying to convert to SSE2: double *pA = a; double *pB = b[voiceIndex]; double *pC = c[voiceIndex]; double *left = audioLeft; double *right = audioRight; double phase = 0.0; double bp0 = mNoteFrequency * mHostPitch; for (int…
markzzz
  • 47,390
  • 120
  • 299
  • 507
3
votes
2 answers

Where do SSE2 intrinsics store results?

I'm moving the first steps into SSE2 in C++. Here's the intrinsic I'm learning right now: __m128d _mm_add_pd (__m128d a, __m128d b) The document says: Add packed double-precision (64-bit) floating-point elements in a and b, and store the results in…
markzzz
  • 47,390
  • 120
  • 299
  • 507
3
votes
1 answer

How to declare __m128i constant in MASM?

align(16) __xmm@200020000a4f0a4f6621662170707070 xmmword 200020000a4f0a4f6621662170707070h and __xmm@200020000a4f0a4f6621662170707070 xmmword 0x200020000a4f0a4f6621662170707070 Both fail, the compiler saying error A2138: invalid data initializer
Soonts
  • 20,079
  • 9
  • 57
  • 130
3
votes
2 answers

Loading xmm register with two UInt64s that are in a pointed to array

I'm trying to load a 128-bit xmm register with two UInt64 integer in Delphi (XE6). Background An XMM register is 128-bits, and can be loaded with multiple, independent, integers. You can then have the CPU add those multiple integers all in parallel.…
Ian Boyd
  • 246,734
  • 253
  • 869
  • 1,219
3
votes
4 answers

How to make the following code faster

int u1, u2; unsigned long elm1[20], _mulpre[16][20], res1[40], res2[40]; 64 bits long res1, res2 initialized to zero. l = 60; while (l) { for (i = 0; i < 20; i += 2) { u1 = (elm1[i] >> l) & 15; u2 =…
anup
  • 529
  • 5
  • 14
3
votes
2 answers

info C5012: loop not parallelized due to reason '1008'

I am trying out the Auto-Vectorizer mode of Visual Studio 2013 on x86_64, and I am a bit surprised with the following. Consider the naive code: static void rescale( double * __restrict out, const int * __restrict in, long n, const double intercept,…
malat
  • 12,152
  • 13
  • 89
  • 158
3
votes
2 answers

Broadcast one arbitrary element of __m128 vector

I need to broadcast one arbitrary element of __m128 vector. For example the second element: __m128 a = {a0, a1, a2, a3}; __m128 b = {a1, a1, a1, a1}; I know that there are intrinsics _mm_set1_ps(float) and _mm_broadcast_ss(float*). But these…
Alex
  • 65
  • 5
3
votes
1 answer

Dot production using sse

#define Size 50000 void main() { unsigned char *arry1 = (unsigned char*)malloc(sizeof(unsigned char)* Size); unsigned char *arry2 = (unsigned char*)malloc(sizeof(unsigned char)* Size); unsigned int *result = (unsigned…
eonjeo
  • 35
  • 1
  • 6
3
votes
3 answers

SSE mov instruction that can skip every 2nd byte?

I need to copy all the odd numbered bytes from one memory location to another. i.e. copy the first, third, fifth etc. Specifically I'm copying from the text area 0xB8000 which contains 2000 character/attribute words. I want to skip the attribute…
poby
  • 1,572
  • 15
  • 39
3
votes
1 answer

How to Multiply 2 16 bit vectors and store result in 32 bit vector in sse?

I need to multiply 2 16 bit vectors and want to get output in 32 bit vectors due to overflow issue similar as below. A = [ 1, 2, 3, 4, 5, 6, 7, 8] B = [ 1, 3, 5, 6, 8, 9, 10 ,12 ] C1= [ 1*1 + 2*3, 3*5, 4*6] c2= [ 5*8, 6* 9, 7*10, 8*12…
Bharat Ahuja
  • 394
  • 2
  • 15