Questions tagged [sse2]

x86 Streaming SIMD Extensions 2 adds support for packed integer and double-precision floats in the 128-byte XMM vector registers. It is always supported on x86-64, and supported on every x86 CPU from 2003 or later.

See the x86 tag wiki for guides and other resources for programming and optimising programs using x86 vector extensions, and the SSE tag wiki for other SSE- and SSE2-related resources.


SSE2 is one of the SSE family of x86 instruction-set extensions.

SSE2 adds support for double-precision floating point, and packed-integer (8bit to 64bit elements) in XMM registers. It is baseline in x86-64, so 64bit code can always assume SSE2 support, without having to check. 32bit code could still be run on a CPU from before 2003 (Athlon XP or Pentium III) that didn't support SSE2, but this is unlikely for most newly-written code. (And so an MMX or original-SSE fallback is not worth writing.)

Most tasks that benefit from vectors at all can be done fairly efficiently using only instructions up to SSE2. This is fortunate, because widespread support for later SSE versions took time. Use of later SSE extensions typically saves a couple instructions here and there, usually with only minor speed-ups. Notably absent until SSSE3 was PSHUFB, a shuffle whose operation was controlled by elements in a register, rather than a compile-time constant imm8. It can do things that SSE2 can't do efficiently at all.

AVX provides 3-operand versions of all SSE2 instructions.

History

Intel introduced SSE2 with their Pentium 4 design in 2001.

SSE2 was adopted by AMD for its 64bit CPU line in 2003/2004. As of 2009 there remain few if any x86 CPUs (at least, in any significant numbers) that do not support the SSE2 instruction set, which makes it extremely attractive on the Windows PC platform by offering a large feature set that can practically be assumed a "minimum requirement" that will be omnipresent (which, however, at least in 32bit mode, does not remove the necessity to check processor features).

More recent instruction sets introduce fewer features which are often highly specialized, and are at the same time supported inconsistenly between manufacturers by a significantly smaller share of processors (10-50% in 2009).

SSE2 does not offer instructions for horizontal addition, which are needed for some geometric calculations (e.g. dot product) and complex arithmetic. This functionality has to be emulated with one or several shuffles, which however are often not significantly slower than the dedicated instructions in higher revisions.

275 questions
12
votes
3 answers

SSE instruction set not enabled

I am getting trouble with this error: "SSE instruction set not enabled". How I can figure this out? I have ACER i7, Ubuntu 11.10, please any one can help me? Any help will be appreciated! Also running: sudo cat /proc/cpuinfo | grep…
ksolid
  • 151
  • 1
  • 2
  • 5
12
votes
3 answers

Emulating shifts on 32 bytes with AVX

I am migrating vectorized code written using SSE2 intrinsics to AVX2 intrinsics. Much to my disappointment, I discover that the shift instructions _mm256_slli_si256 and _mm256_srli_si256 operate only on the two halves of the AVX registers separately…
user1196549
12
votes
4 answers

Fast counting the number of set bits in __m128i register

I should count the number of set bits of a __m128i register. In particular, I should write two functions that are able to count the number of bits of the register, using the following ways. The total number of set bits of the register. The number…
enzom83
  • 8,080
  • 10
  • 68
  • 114
11
votes
1 answer

SSE runs slow after using AVX

I have a strange issue with some SSE2 and AVX code I have been working on. I am building my application using GCC which runtime cpu feature detection. The object files are built with seperate flags for each CPU feature, for example: g++ -c -o…
Geoffrey
  • 10,843
  • 3
  • 33
  • 46
11
votes
4 answers

Fast counting the number of equal bytes between two arrays

I wrote the function int compare_16bytes(__m128i lhs, __m128i rhs) in order to compare two 16 byte numbers using SSE instructions: this function returns how many bytes are equal after performing the comparison. Now I would like use the above…
enzom83
  • 8,080
  • 10
  • 68
  • 114
11
votes
1 answer

Is it possible to use SSE and SSE2 to make a 128-bit wide integer?

I'm looking to understand SSE2's capabilities a little more, and would like to know if one could make a 128-bit wide integer that supports addition, subtraction, XOR and multiplication?
Erkling
  • 509
  • 4
  • 16
10
votes
1 answer

Best way to load/store from/to general purpose registers to/from xmm/ymm register

What is best way to load and store generate purpose registers to/from SIMD registers? So far I have been using the stack as a temporary. For example, mov [rsp + 0x00], r8 mov [rsp + 0x08], r9 mov [rsp + 0x10], r10 mov [rsp + 0x18], r11 vmovdqa ymm0,…
Yan Zhou
  • 2,709
  • 2
  • 22
  • 37
10
votes
3 answers

numpy calling sse2 via ctypes

In brief, I am trying to call into a shared library from python, more specifically, from numpy. The shared library is implemented in C using sse2 instructions. Enabling optimisation, i.e. building the library with -O2 or –O1, I am facing strange…
Daniel
  • 101
  • 4
9
votes
4 answers

How can I set __m128i without using of any SSE instruction?

I have many function which use the same constant __m128i values. For example: const __m128i K8 = _mm_setr_epi8(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16); const __m128i K16 = _mm_setr_epi16(1, 2, 3, 4, 5, 6, 7, 8); const __m128i K32 =…
Akira
  • 213
  • 2
  • 11
9
votes
1 answer

Why does V8 in Node.js 0.12.0 release require SSE2 CPU instructions?

Trying to upgrade Node.js from 0.10.x to 0.12.0. The first thing noticed is that I am getting an error that SSE2 instructions are not supported by my CPU (indeed they are not). Tried to compile Node.js from sources but it failed for the same reason.…
Pavel Lobodinský
  • 1,028
  • 1
  • 12
  • 25
9
votes
2 answers

SSE2 instruction to load integers in reverse order

Is there any SSE2 instruction to load a 128 bit int vector register from an int buffer, in reverse order ?
Andy
  • 157
  • 1
  • 6
9
votes
1 answer

SSE instructions to add all elements of an array

I am new to SSE2 instructions. I have found an instruction _mm_add_epi8 which can add two array elements. But I want an SSE instruction which can add all elements of an array. I was trying to develop this concept using this code: #include…
geeta
  • 689
  • 3
  • 17
  • 33
9
votes
1 answer

What does the following assembly instruction do addsd -8(%rbp), %xmm0?

I'm trying to figure out what the assembly instruction actually does addsd -8(%rbp), %xmm0 I know that it's a floating point addition on an x86-64 machine with SSE2. Also, I know that %xmm0 is a register. However, what I'm not sure of is what…
owagh
  • 3,428
  • 2
  • 31
  • 53
8
votes
2 answers

SSE2 code optimization

I am using SSE2 intrinsics to optimize the bottlenecks of my application and have the following question: ddata = _mm_xor_si128(_mm_xor_si128( _mm_sll_epi32(xdata, 0x7u), _mm_srl_epi32(tdata, 0x19u)), xdata); On Microsoft C++ Compiler this…
Yippie-Ki-Yay
  • 22,026
  • 26
  • 90
  • 148
8
votes
1 answer

Fastest way to perform AVX inner product operations with mixed (float, double) input vectors

I need to build a single-precision floating-point inner product routine for mixed single/double-precision floating-point vectors, exploiting the AVX instruction set for SIMD registers with 256 bits. Problem: one input vector is float (x), while the…
Liotro78
  • 111
  • 5
1
2
3
18 19