Questions tagged [sse2]

x86 Streaming SIMD Extensions 2 adds support for packed integer and double-precision floats in the 128-byte XMM vector registers. It is always supported on x86-64, and supported on every x86 CPU from 2003 or later.

See the x86 tag wiki for guides and other resources for programming and optimising programs using x86 vector extensions, and the SSE tag wiki for other SSE- and SSE2-related resources.


SSE2 is one of the SSE family of x86 instruction-set extensions.

SSE2 adds support for double-precision floating point, and packed-integer (8bit to 64bit elements) in XMM registers. It is baseline in x86-64, so 64bit code can always assume SSE2 support, without having to check. 32bit code could still be run on a CPU from before 2003 (Athlon XP or Pentium III) that didn't support SSE2, but this is unlikely for most newly-written code. (And so an MMX or original-SSE fallback is not worth writing.)

Most tasks that benefit from vectors at all can be done fairly efficiently using only instructions up to SSE2. This is fortunate, because widespread support for later SSE versions took time. Use of later SSE extensions typically saves a couple instructions here and there, usually with only minor speed-ups. Notably absent until SSSE3 was PSHUFB, a shuffle whose operation was controlled by elements in a register, rather than a compile-time constant imm8. It can do things that SSE2 can't do efficiently at all.

AVX provides 3-operand versions of all SSE2 instructions.

History

Intel introduced SSE2 with their Pentium 4 design in 2001.

SSE2 was adopted by AMD for its 64bit CPU line in 2003/2004. As of 2009 there remain few if any x86 CPUs (at least, in any significant numbers) that do not support the SSE2 instruction set, which makes it extremely attractive on the Windows PC platform by offering a large feature set that can practically be assumed a "minimum requirement" that will be omnipresent (which, however, at least in 32bit mode, does not remove the necessity to check processor features).

More recent instruction sets introduce fewer features which are often highly specialized, and are at the same time supported inconsistenly between manufacturers by a significantly smaller share of processors (10-50% in 2009).

SSE2 does not offer instructions for horizontal addition, which are needed for some geometric calculations (e.g. dot product) and complex arithmetic. This functionality has to be emulated with one or several shuffles, which however are often not significantly slower than the dedicated instructions in higher revisions.

275 questions
6
votes
2 answers

Scaling byte pixel values (y=ax+b) with SSE2 (as floats)?

I want to calculate y = ax + b, where x and y is a pixel value [i.e, byte with value range is 0~255], while a and b is a float Since I need to apply this formula for each pixel in image, in addition, a and b is different for different pixel. Direct…
Edwin
  • 73
  • 5
6
votes
2 answers

C/C++: -msse and -msse2 Flags do not have any effect on the binaries?

I'm just playing around with gcc (g++) and the compilerflags -msse and -msse2. I have a little test program which looks like that: #include int main(int argc, char **argv) { float a = 12558.5688; float b = 6.5585; float…
Fabian
  • 492
  • 6
  • 20
6
votes
2 answers

What is the difference between these 128bit SIMD xor operations

Intel provides several SIMD commands, which seems all performing bitwise XOR on 128-bit data: _mm_xor_pd(__m128d, __m128d) _mm_xor_ps(__m128, __m128) _mm_xor_si128(__m128i, __m128i) Isn't bitwise operations only operate on bit streams? Why there…
jiandingzhe
  • 1,881
  • 15
  • 35
6
votes
2 answers

Is SSE2 signed integer overflow undefined?

Signed integer overflow is undefined in C and C++. But what about signed integer overflow within the individual fields of an __m128i? In other words, is this behavior defined in the Intel standards? #include #include…
Myria
  • 3,372
  • 1
  • 24
  • 42
6
votes
3 answers

How to store the contents of a __m128d simd vector as doubles without accessing it as a union?

The code i want to optimize is basically a simple but large arithmetic formula, it should be fairly simple to analyze the code automatically to compute the independent multiplications/additions in parallel, but i read that autovectorization only…
the_toast
  • 175
  • 1
  • 7
5
votes
3 answers

How to simulate pcmpgtq on sse2?

PCMPGTQ was introduced in sse4.2, and it provides a greater than signed comparison for 64 bit numbers that yields a mask. How does one support this functionality on instructions sets predating sse4.2? Update: This same question applies to ARMv7 with…
Dan Weber
  • 401
  • 2
  • 9
5
votes
2 answers

How to rotate packed quadwords in xmm register?

Given an 128-bit xmm register that is packed with two quadwords (i.e. two 64-bit integers): ╭──────────────────┬──────────────────╮ xmm0 │ ffeeddccbbaa9988 │ 7766554433221100 │ ╰──────────────────┴──────────────────╯ How can i perform a…
Ian Boyd
  • 246,734
  • 253
  • 869
  • 1,219
5
votes
0 answers

GCC std::sin vectorization bug?

The next code (with -O3 -ffast-math): #include float a[4]; void sin1() { for(unsigned i = 0; i < 4; i++) a[i] = sinf(a[i]); } Compiles vectorized version of sinf (_ZGVbN4v_sinf): sin1(): sub rsp, 8 movaps xmm0,…
Diego91b
  • 121
  • 4
5
votes
1 answer

SSE2 optimization for converting from RGB565 to RGB888 (no alpha channel)

I am trying to convert a buffer of bits, from 16 bits per pixel: RGB 565: rrrrrggggggbbbb|rrr.. to 24 bits per pixel: RGB888 rrrrrrrrgggggggbbbbbbb|rrr... I have a quite optimized algorithm but I am quite curious of how can this be done using…
JoniPichoni
  • 239
  • 1
  • 11
5
votes
4 answers

SIMD code runs slower than scalar code

elma and elmc are both unsigned long arrays. So are res1 and res2. unsigned long simdstore[2]; __m128i *p, simda, simdb, simdc; p = (__m128i *) simdstore; for (i = 0; i < _polylen; i++) { u1 = (elma[i] >> l) & 15; u2 = (elmc[i]…
anup
  • 529
  • 5
  • 14
5
votes
2 answers

sse2 vectorization and virtual machines

I am considering vectorizing some floor() calls using sse2 intrinsics, then measuring the performance gain. But ultimately the binary is going to be run on a virtual machine which I have no access to. I don't really know how a VM works. Is a binary…
ThreeStarProgrammer57
  • 2,906
  • 2
  • 16
  • 24
5
votes
2 answers

The best way to shift a __m128i?

I need to shift a __m128i variable, (say v), by m bits, in such a way that bits move through all of the variable (So, the resulting variable represents v*2^m). What is the best way to do this?! Note that _mm_slli_epi64 shifts v0 and v1 seperately:…
user0
  • 51
  • 1
  • 3
5
votes
1 answer

C: x86 Intel Intrinsics usage of _mm_log2_ps() -> error: incompatible type 'int'?

I'm trying to apply the log2 onto a __m128 variable. Like this: #include int main (void) { __m128 two_v = {2.0, 2.0, 2.0, 2.0}; __m128 log2_v = _mm_log2_ps(two_v); // log_2 := log(2) return 0; } Trying to compile this…
tmuecksch
  • 6,222
  • 6
  • 40
  • 61
5
votes
1 answer

Exception in statically linked msvcrt using Visual Studio 2012

There seems to be a problem in the statically linked version of VS2012. Starting a console application on an old system leads to an exception, whenever streams are used, although new systems causes no trouble at all. To reproduce this error it is…
5
votes
1 answer

optimize unaligned SSE2/AVX2 XOR

In my code I have to handle "unmasking" of websocket packets, which essentially means XOR'ing unaligned data of arbitrary length. Thanks to SO (Websocket data unmasking / multi byte xor) I already have found out how to (hopefully) speed this up…
griffin
  • 1,261
  • 8
  • 24