Questions tagged [sse2]

x86 Streaming SIMD Extensions 2 adds support for packed integer and double-precision floats in the 128-byte XMM vector registers. It is always supported on x86-64, and supported on every x86 CPU from 2003 or later.

See the x86 tag wiki for guides and other resources for programming and optimising programs using x86 vector extensions, and the SSE tag wiki for other SSE- and SSE2-related resources.


SSE2 is one of the SSE family of x86 instruction-set extensions.

SSE2 adds support for double-precision floating point, and packed-integer (8bit to 64bit elements) in XMM registers. It is baseline in x86-64, so 64bit code can always assume SSE2 support, without having to check. 32bit code could still be run on a CPU from before 2003 (Athlon XP or Pentium III) that didn't support SSE2, but this is unlikely for most newly-written code. (And so an MMX or original-SSE fallback is not worth writing.)

Most tasks that benefit from vectors at all can be done fairly efficiently using only instructions up to SSE2. This is fortunate, because widespread support for later SSE versions took time. Use of later SSE extensions typically saves a couple instructions here and there, usually with only minor speed-ups. Notably absent until SSSE3 was PSHUFB, a shuffle whose operation was controlled by elements in a register, rather than a compile-time constant imm8. It can do things that SSE2 can't do efficiently at all.

AVX provides 3-operand versions of all SSE2 instructions.

History

Intel introduced SSE2 with their Pentium 4 design in 2001.

SSE2 was adopted by AMD for its 64bit CPU line in 2003/2004. As of 2009 there remain few if any x86 CPUs (at least, in any significant numbers) that do not support the SSE2 instruction set, which makes it extremely attractive on the Windows PC platform by offering a large feature set that can practically be assumed a "minimum requirement" that will be omnipresent (which, however, at least in 32bit mode, does not remove the necessity to check processor features).

More recent instruction sets introduce fewer features which are often highly specialized, and are at the same time supported inconsistenly between manufacturers by a significantly smaller share of processors (10-50% in 2009).

SSE2 does not offer instructions for horizontal addition, which are needed for some geometric calculations (e.g. dot product) and complex arithmetic. This functionality has to be emulated with one or several shuffles, which however are often not significantly slower than the dedicated instructions in higher revisions.

275 questions
3
votes
1 answer

Using inline assembly to speed up Matrix multiplication

I have been trying to speed up matrix-matrix multiplication C <- C + alpha * A * B via register blocking, SSE2 vectorization and L1 cache blocking (note that I have specially chosen the transpose setting op(A)=A and op(B)=B). After some effort my…
Zheyuan Li
  • 71,365
  • 17
  • 180
  • 248
3
votes
2 answers

How to efficiently add two vectors in C++

Suppose I have two vectors a and b, stored as a vector. I want to make a += b or a +=b * k, where k is a number. I can for sure do the following, while (size--) { (*a++) += (*b++) * k; } But what are the possible ways to easily leverage SIMD…
Nan Hua
  • 3,414
  • 3
  • 17
  • 24
3
votes
1 answer

Bus error when executing `emms` MMX instruction

I'm working on a port of some software with inline assembly because we took a few bug reports from a Debian maintainer under X32. The code is fine under both X86 and X64. We're catching a bus error on the emms instruction: ... 0x005520fd…
jww
  • 97,681
  • 90
  • 411
  • 885
3
votes
1 answer

SSE2 Saturated Arithmetic

I'm writing some audio processing software and I need to know how to do saturated arithmetic with SSE2 double-precision instructions. My values need to be normalized between -1 and 1. Is there a clever way to do this with SSE2 intrinsic or do I need…
Caleb Merchant
  • 289
  • 1
  • 5
  • 16
3
votes
1 answer

SIMD performance on rewriting OpenCV dilate

I am trying to rewrite the OpenCV dilate function to practice SIMD programming. For simplicity, only non-separable case is considered. Much of the code looks like the OpenCV version. The result, however, shows that OpenCV is more than 10 times…
beaver
  • 550
  • 1
  • 9
  • 23
3
votes
1 answer

Optimizing RGB565 to RGB888 conversions with SSE2

I'm trying to optimize pixel depth conversion from 565 to 888 using SSE2 with the basic formula: col8 = col5 << 3 | col5 >> 2 col8 = col6 << 2 | col6 >> 4 I take two 2x565 128-bit vectors and I'm outputing 3x888 128-bit vectors. After some masking,…
kyku
  • 5,892
  • 4
  • 43
  • 51
3
votes
1 answer

#error “SSE2 instruction set not enabled” when installing scikit-bio via pip

I want to install the python library scikit-bio via pip using following command: sudo pip install scikit-bio on my system: uname -a Linux grassgis 3.2.0-69-generic-pae #103-Ubuntu SMP Tue Sep 2 05:15:53 UTC 2014 i686 i686 i386 GNU/Linux However…
Johannes
  • 1,024
  • 13
  • 32
3
votes
1 answer

Intel intrinsics support for Atom cloverview processor

I have an application which was designed for Sandbridge processors using SSE to AVX, now I want the same application to run on Atom Processors. I was recently browsing net for intrinsic support for Atom cloverview processors. Every where it mentions…
Harrisson
  • 255
  • 2
  • 21
3
votes
1 answer

SIMD SSE2 __m128i contains 4 int32_t how to quickly find each integer that bigger or small than 0

I used SIMD to do an arithmetic operation, the result is in a __m128i variable which contains 4 x int32_t. I suspect the first two int32_t values in the result are >=0 and the last two values are <=0. How could I quickly find out ? __m128i result…
Lucien
  • 59
  • 3
3
votes
1 answer

Add the upper and lower 64-bits of a 128-bit xmm register

I have two packed quadword integers in xmm0 and I need to add them together and store the result in a memory location. I can guarantee that the value of the each integer is less than 2^15. Right now, I'm doing the following: int temp; .... …
Jacob
  • 34,255
  • 14
  • 110
  • 165
3
votes
0 answers

pextrd vs psrldp+movd vs others, Which is better for extracting one element from?

I need implement a vpgatherdd-like mechanism without AVX2. Say, I have 4 i32 offset packed in xmm0. I will need to extract each element in xmm0, to do the mov reg, [base + offset] job. The problem is that how should I extract the elements? There is…
BlueWanderer
  • 2,671
  • 2
  • 21
  • 36
3
votes
2 answers

How to process a 24-bit 3 channel color image with SSE2/SSE3/SSE4?

I just started to use SS2 optimization of image processing, but for the 3 channel 24 bit color images have no idea. My pix data arranged by BGR BGR BGR ... ,unsigned char 8-bi, so if I want to implement the Color2Gray with SSE2/SSE3/SSE4's…
tocky
  • 107
  • 1
  • 7
2
votes
2 answers

Alternative to manual fix-up of sse2 data alignement on a 16-byte boundary

Is there an alternative to the following manual fix-up: // excerpt adapted from SIMDTest in // http://www.mccauslandcenter.sc.edu/mricro/obsolete/graphics/simdtest.zip // var lAdblRAp, lArraySz, lAdblRA, Doublep: LongInt; begin // ... …
menjaraz
  • 7,551
  • 4
  • 41
  • 81
2
votes
1 answer

md5 vectorized sse* && avx

I am looking for information on the implementation of md5 algorithm using vectorization. I am interested in the details of SSE* and the AVX instructions.Are there any ready-made library with support for vectorization?
DrEvil35
  • 127
  • 1
  • 8
2
votes
1 answer

Migrate SSE2 to Arm NEON intrinsincs

I have the following code in SSE2 intrinsincs. It processes input from a Kinect. __m128i md = _mm_setr_epi16((r0<<3) | (r1>>5), ((r1<<6) | (r2>>2) ), ((r2<<9) | (r3<<1) | (r4>>7) ), ((r4<<4) | (r5>>4) ), ((r5<<7) | (r6>>1) ),((r6<<10) | (r7<<2)…
Yannis
  • 71
  • 1
  • 2