Questions tagged [sse2]

x86 Streaming SIMD Extensions 2 adds support for packed integer and double-precision floats in the 128-byte XMM vector registers. It is always supported on x86-64, and supported on every x86 CPU from 2003 or later.

See the x86 tag wiki for guides and other resources for programming and optimising programs using x86 vector extensions, and the SSE tag wiki for other SSE- and SSE2-related resources.


SSE2 is one of the SSE family of x86 instruction-set extensions.

SSE2 adds support for double-precision floating point, and packed-integer (8bit to 64bit elements) in XMM registers. It is baseline in x86-64, so 64bit code can always assume SSE2 support, without having to check. 32bit code could still be run on a CPU from before 2003 (Athlon XP or Pentium III) that didn't support SSE2, but this is unlikely for most newly-written code. (And so an MMX or original-SSE fallback is not worth writing.)

Most tasks that benefit from vectors at all can be done fairly efficiently using only instructions up to SSE2. This is fortunate, because widespread support for later SSE versions took time. Use of later SSE extensions typically saves a couple instructions here and there, usually with only minor speed-ups. Notably absent until SSSE3 was PSHUFB, a shuffle whose operation was controlled by elements in a register, rather than a compile-time constant imm8. It can do things that SSE2 can't do efficiently at all.

AVX provides 3-operand versions of all SSE2 instructions.

History

Intel introduced SSE2 with their Pentium 4 design in 2001.

SSE2 was adopted by AMD for its 64bit CPU line in 2003/2004. As of 2009 there remain few if any x86 CPUs (at least, in any significant numbers) that do not support the SSE2 instruction set, which makes it extremely attractive on the Windows PC platform by offering a large feature set that can practically be assumed a "minimum requirement" that will be omnipresent (which, however, at least in 32bit mode, does not remove the necessity to check processor features).

More recent instruction sets introduce fewer features which are often highly specialized, and are at the same time supported inconsistenly between manufacturers by a significantly smaller share of processors (10-50% in 2009).

SSE2 does not offer instructions for horizontal addition, which are needed for some geometric calculations (e.g. dot product) and complex arithmetic. This functionality has to be emulated with one or several shuffles, which however are often not significantly slower than the dedicated instructions in higher revisions.

275 questions
2
votes
2 answers

Packing and unpacking data for SSE/SSE2 instructions?

I'm trying to learn more about how SSE/SSE2 work: I know that SSE/SSE2 use mmx registers with a size of 128 bit (16 byte) and that usually these registers have 4 float cells where I can store my floats by packing. Before getting the result I should…
Johnny Pauling
  • 12,701
  • 18
  • 65
  • 108
2
votes
1 answer

ROS (Robot Operating System) with SSSE3 flag

I started working with ROS lately and got stuck on one problem. I need to use some classes whick require SSE2, SSE3 and SSSE3 CPU extensions. I tried to edit the manifest.xml file of my ROS Package like
SolvedForHome
  • 152
  • 1
  • 15
2
votes
6 answers

What's the most efficient way to multiply 4 floats by 4 floats using SSE?

I currently have the following code: float a[4] = { 10, 20, 30, 40 }; float b[4] = { 0.1, 0.1, 0.1, 0.1 }; asm volatile("movups (%0), %%xmm0\n\t" "mulps (%1), %%xmm0\n\t" "movups %%xmm0, (%1)" …
horseyguy
  • 29,455
  • 20
  • 103
  • 145
2
votes
1 answer

How to align 16-bit ints for use with SSE intrinsics

I am working with two-dimensional arrays of 16-bit integers defined as int16_t e[MAX_SIZE*MAX_NODE][MAX_SIZE]; int16_t C[MAX_SIZE][MAX_SIZE]; Where Max_SIZE and MAX_NODE are constant values. I'm not a professional programmer, but somehow with the…
SMir
  • 650
  • 1
  • 7
  • 19
2
votes
1 answer

Sum of the four 32bits elements of a _m128 vector

I'm using intrinsics to optimize a program of mine. But now I would like to sum the four elements that are in a __m128 vector in order to compare the result to a floating point value. For instance, let's say I have this 128 bits vector : {a, b c,…
Merkil
  • 23
  • 3
1
vote
1 answer

Issue parallelizing this c code with openmp

How can I parallelize this code with OpenMP?. The result I get is not correct. I try to use temporary variables p1aux, p2aux, and psumaux because the Reduction clause cannot be used with pointers or intrinsic functions. But as I said the result…
1
vote
1 answer

Translate GCC inline asm (SSE2, SSSE3) to MSVC intrinsics

I'm borrowing some code from the VLC to my video player, written in MSVC++ 2010, and cannot find equivalent to its inline asms, related to extracting decoded video frame from the GPU memory to the conventional memory. Particularly, I don't know how…
wl2776
  • 4,099
  • 4
  • 35
  • 77
1
vote
0 answers

Xcode4: can't compile sse2 assemble via nasm for x86_64 architecture

Currently I'm switching a codec project from 32bit to 64 bit architecture in Xcode4, the *cpp part files are compiling well, but the .asm (all sse2 assemble) files seems can't be compiled into object files at all via nasm (it's OK in 32 bit…
Horace
  • 11
  • 1
1
vote
1 answer

Matrix multiplication using simd produces incorrect results when filled with floating point values

I wanted to create a matrix multiplication with simd. Everything is fine, when matrix is filled with some integers. But there are some issues when my matrices are filled with floating point values. The results are not quite correct. Here is my…
Arheus
  • 21
  • 3
1
vote
0 answers

Sum of bytes in an __m128 register

I am trying to find the sum of all bytes in an __m128 register using SSE and SSE2. So far what I have is __m128i sum = _mm_sad_epu8(bytes, _mm_setzero_si128()); return _mm_cvtsi128_si32(sum) + _mm_extract_epi16(sum, 4); where bytes is the __m128…
1
vote
1 answer

In SIMD, SSE2,many instructions named as "_mm_set_epi8","_mm_cmpgt_epi8 " and so on,what does "mm" "epi" mean?

I see many instruction with shorthand such as "_mm_and_si128". I want to know what does the "mm" mean.
dongwang
  • 13
  • 2
1
vote
0 answers

MOVDQU vs MOVDQA Instruction (x86/x64 assembly) better insights

First of all, let's start with the following links about MOVDQA and MOVDQU which are already in this community: MOVDQU instruction + page boundary MOVUPD vs. MOVDQU (x86/x64 assembly) Difference between MOVDQA and MOVAPS x86 instructions? Assembly…
RajibTheKing
  • 1,234
  • 1
  • 15
  • 35
1
vote
0 answers

Efficiently find indices of 1-bits in large array, using SIMD

If I have very large array of bytes and want to find indices of all 1-bits, indices counting from leftmost bit, how do I do this efficiently, probably using SIMD. (For finding the first 1-bit, see an earlier question. This question produces an…
Arty
  • 14,883
  • 6
  • 36
  • 69
1
vote
0 answers

C++ std::countr_zero() in SIMD 128/256/512 (find position of least significant 1 bit in 128/256/512-bit number)

If I have 128 or 256 or 512 bit memory region, how can I find number of consecutive zero bits starting from least significant bit (left-most byte). I can do: Try it online! #include int CountRZero512(uint64_t const * ptr) { for (int i = 0;…
Arty
  • 14,883
  • 6
  • 36
  • 69
1
vote
1 answer

Access value from __m128 in rust by index

I have seen that it's rather simple in C to access values in a __m128 register by index. However, it is not possible to do that in rust. How can I access those values? Concretely, I am calculating four values at once, then I compare them using…
Marc
  • 41
  • 3