Questions tagged [sse2]

x86 Streaming SIMD Extensions 2 adds support for packed integer and double-precision floats in the 128-byte XMM vector registers. It is always supported on x86-64, and supported on every x86 CPU from 2003 or later.

See the x86 tag wiki for guides and other resources for programming and optimising programs using x86 vector extensions, and the SSE tag wiki for other SSE- and SSE2-related resources.

SSE2 is one of the SSE family of x86 instruction-set extensions.

SSE2 adds support for double-precision floating point, and packed-integer (8bit to 64bit elements) in XMM registers. It is baseline in x86-64, so 64bit code can always assume SSE2 support, without having to check. 32bit code could still be run on a CPU from before 2003 (Athlon XP or Pentium III) that didn't support SSE2, but this is unlikely for most newly-written code. (And so an MMX or original-SSE fallback is not worth writing.)

Most tasks that benefit from vectors at all can be done fairly efficiently using only instructions up to SSE2. This is fortunate, because widespread support for later SSE versions took time. Use of later SSE extensions typically saves a couple instructions here and there, usually with only minor speed-ups. Notably absent until SSSE3 was PSHUFB, a shuffle whose operation was controlled by elements in a register, rather than a compile-time constant imm8. It can do things that SSE2 can't do efficiently at all.

AVX provides 3-operand versions of all SSE2 instructions.

History

Intel introduced SSE2 with their Pentium 4 design in 2001.

SSE2 was adopted by AMD for its 64bit CPU line in 2003/2004. As of 2009 there remain few if any x86 CPUs (at least, in any significant numbers) that do not support the SSE2 instruction set, which makes it extremely attractive on the Windows PC platform by offering a large feature set that can practically be assumed a "minimum requirement" that will be omnipresent (which, however, at least in 32bit mode, does not remove the necessity to check processor features).

More recent instruction sets introduce fewer features which are often highly specialized, and are at the same time supported inconsistenly between manufacturers by a significantly smaller share of processors (10-50% in 2009).

SSE2 does not offer instructions for horizontal addition, which are needed for some geometric calculations (e.g. dot product) and complex arithmetic. This functionality has to be emulated with one or several shuffles, which however are often not significantly slower than the dedicated instructions in higher revisions.

275 questions

votes

2 answers

Scaling byte pixel values (y=ax+b) with SSE2 (as floats)?

I want to calculate y = ax + b, where x and y is a pixel value [i.e, byte with value range is 0~255], while a and b is a float Since I need to apply this formula for each pixel in image, in addition, a and b is different for different pixel. Direct…

asked Aug 29 '15 at 08:26

Edwin

votes

2 answers

C/C++: -msse and -msse2 Flags do not have any effect on the binaries?

I'm just playing around with gcc (g++) and the compilerflags -msse and -msse2. I have a little test program which looks like that: #include int main(int argc, char **argv) { float a = 12558.5688; float b = 6.5585; float…

c++ gcc sse sse2

asked Apr 26 '15 at 07:50

Fabian

votes

2 answers

What is the difference between these 128bit SIMD xor operations

Intel provides several SIMD commands, which seems all performing bitwise XOR on 128-bit data: _mm_xor_pd(__m128d, __m128d) _mm_xor_ps(__m128, __m128) _mm_xor_si128(__m128i, __m128i) Isn't bitwise operations only operate on bit streams? Why there…

simd sse intrinsics sse2

asked Mar 18 '15 at 13:04

jiandingzhe

1,881
15
35

votes

2 answers

Is SSE2 signed integer overflow undefined?

Signed integer overflow is undefined in C and C++. But what about signed integer overflow within the individual fields of an __m128i? In other words, is this behavior defined in the Intel standards? #include #include…

c language-lawyer undefined-behavior sse2

asked Oct 22 '14 at 21:02

Myria

3,372
1
24
42

votes

3 answers

How to store the contents of a __m128d simd vector as doubles without accessing it as a union?

The code i want to optimize is basically a simple but large arithmetic formula, it should be fairly simple to analyze the code automatically to compute the independent multiplications/additions in parallel, but i read that autovectorization only…

c x86 simd intrinsics sse2

asked Sep 19 '12 at 13:13

the_toast

votes

3 answers

How to simulate pcmpgtq on sse2?

PCMPGTQ was introduced in sse4.2, and it provides a greater than signed comparison for 64 bit numbers that yields a mask. How does one support this functionality on instructions sets predating sse4.2? Update: This same question applies to ARMv7 with…

assembly sse simd sse2 sse4

asked Dec 06 '20 at 08:36

Dan Weber

votes

2 answers

How to rotate packed quadwords in xmm register?

Given an 128-bit xmm register that is packed with two quadwords (i.e. two 64-bit integers): ╭──────────────────┬──────────────────╮ xmm0 │ ffeeddccbbaa9988 │ 7766554433221100 │ ╰──────────────────┴──────────────────╯ How can i perform a…

x86 sse2

asked Dec 06 '18 at 02:15

Ian Boyd

246,734
253
869
1,219

votes

0 answers

GCC std::sin vectorization bug?

The next code (with -O3 -ffast-math): #include float a[4]; void sin1() { for(unsigned i = 0; i < 4; i++) a[i] = sinf(a[i]); } Compiles vectorized version of sinf (_ZGVbN4v_sinf): sin1(): sub rsp, 8 movaps xmm0,…

c++ gcc vectorization sse2

asked Aug 03 '17 at 08:31

Diego91b

votes

1 answer

SSE2 optimization for converting from RGB565 to RGB888 (no alpha channel)

I am trying to convert a buffer of bits, from 16 bits per pixel: RGB 565: rrrrrggggggbbbb|rrr.. to 24 bits per pixel: RGB888 rrrrrrrrgggggggbbbbbbb|rrr... I have a quite optimized algorithm but I am quite curious of how can this be done using…

c++ simd intrinsics sse2 color-conversion

asked Jun 29 '17 at 08:14

JoniPichoni

votes

4 answers

SIMD code runs slower than scalar code

elma and elmc are both unsigned long arrays. So are res1 and res2. unsigned long simdstore[2]; __m128i *p, simda, simdb, simdc; p = (__m128i *) simdstore; for (i = 0; i < _polylen; i++) { u1 = (elma[i] >> l) & 15; u2 = (elmc[i]…

c optimization sse simd sse2

asked Dec 09 '10 at 04:47

anup

votes

2 answers

sse2 vectorization and virtual machines

I am considering vectorizing some floor() calls using sse2 intrinsics, then measuring the performance gain. But ultimately the binary is going to be run on a virtual machine which I have no access to. I don't really know how a VM works. Is a binary…

c++ virtual-machine vectorization sse2

asked Jan 18 '17 at 22:48

ThreeStarProgrammer57

2,906
2
16
24

votes

2 answers

The best way to shift a __m128i?

I need to shift a __m128i variable, (say v), by m bits, in such a way that bits move through all of the variable (So, the resulting variable represents v*2^m). What is the best way to do this?! Note that _mm_slli_epi64 shifts v0 and v1 seperately:…

c bitwise-operators sse bit-shift sse2

asked Dec 27 '15 at 07:01

user0

votes

1 answer

C: x86 Intel Intrinsics usage of _mm_log2_ps() -> error: incompatible type 'int'?

I'm trying to apply the log2 onto a __m128 variable. Like this: #include int main (void) { __m128 two_v = {2.0, 2.0, 2.0, 2.0}; __m128 log2_v = _mm_log2_ps(two_v); // log_2 := log(2) return 0; } Trying to compile this…

c++ compiler-errors sse intrinsics sse2

asked Nov 21 '13 at 14:34

tmuecksch

6,222
6
40
61

votes

1 answer

Exception in statically linked msvcrt using Visual Studio 2012

There seems to be a problem in the statically linked version of VS2012. Starting a console application on an old system leads to an exception, whenever streams are used, although new systems causes no trouble at all. To reproduce this error it is…

c++ visual-studio-2012 cpu-architecture msvcrt sse2

asked Sep 10 '13 at 21:07

user2766445

votes

1 answer

optimize unaligned SSE2/AVX2 XOR

In my code I have to handle "unmasking" of websocket packets, which essentially means XOR'ing unaligned data of arbitrary length. Thanks to SO (Websocket data unmasking / multi byte xor) I already have found out how to (hopefully) speed this up…

c optimization memory-alignment sse2 avx2

asked Jul 24 '13 at 16:26

griffin

1,261
8
24

Prev 1 2 3

…

18 19 Next