Questions tagged [sse2]

x86 Streaming SIMD Extensions 2 adds support for packed integer and double-precision floats in the 128-byte XMM vector registers. It is always supported on x86-64, and supported on every x86 CPU from 2003 or later.

See the x86 tag wiki for guides and other resources for programming and optimising programs using x86 vector extensions, and the SSE tag wiki for other SSE- and SSE2-related resources.

SSE2 is one of the SSE family of x86 instruction-set extensions.

SSE2 adds support for double-precision floating point, and packed-integer (8bit to 64bit elements) in XMM registers. It is baseline in x86-64, so 64bit code can always assume SSE2 support, without having to check. 32bit code could still be run on a CPU from before 2003 (Athlon XP or Pentium III) that didn't support SSE2, but this is unlikely for most newly-written code. (And so an MMX or original-SSE fallback is not worth writing.)

Most tasks that benefit from vectors at all can be done fairly efficiently using only instructions up to SSE2. This is fortunate, because widespread support for later SSE versions took time. Use of later SSE extensions typically saves a couple instructions here and there, usually with only minor speed-ups. Notably absent until SSSE3 was PSHUFB, a shuffle whose operation was controlled by elements in a register, rather than a compile-time constant imm8. It can do things that SSE2 can't do efficiently at all.

AVX provides 3-operand versions of all SSE2 instructions.

History

Intel introduced SSE2 with their Pentium 4 design in 2001.

SSE2 was adopted by AMD for its 64bit CPU line in 2003/2004. As of 2009 there remain few if any x86 CPUs (at least, in any significant numbers) that do not support the SSE2 instruction set, which makes it extremely attractive on the Windows PC platform by offering a large feature set that can practically be assumed a "minimum requirement" that will be omnipresent (which, however, at least in 32bit mode, does not remove the necessity to check processor features).

More recent instruction sets introduce fewer features which are often highly specialized, and are at the same time supported inconsistenly between manufacturers by a significantly smaller share of processors (10-50% in 2009).

SSE2 does not offer instructions for horizontal addition, which are needed for some geometric calculations (e.g. dot product) and complex arithmetic. This functionality has to be emulated with one or several shuffles, which however are often not significantly slower than the dedicated instructions in higher revisions.

275 questions

votes

3 answers

SSE multiplication of 2 64-bit integers

How to multiply two 64-bit integers by another 2 64-bit integers? I didn't find any instruction which can do it.

asked Jul 25 '13 at 16:14

Ines Karmani

votes

1 answer

Simulating packusdw functionality with SSE2

I'm implementing a fast x888 -> 565 pixel conversion function in pixman according to the algorithm described by Intel [pdf]. Their code converts x888 -> 555 while I want to convert to 565. Unfortunately, converting to 565 means that the high bit is…

x86 sse intrinsics sse2 sse4

asked Jun 13 '12 at 23:14

mattst88

1,462
13
21

votes

1 answer

SIMD array add for arbitrary array lengths

I'm learning to use SIMD capabilities by re-writing my personal image processing library using vector intrinsics. One basic function is a simple "array +=," i.e. void arrayAdd(unsigned char* A, unsigned char* B, size_t n) { for(size_t i=0; i <…

c arrays sse simd sse2

asked Apr 16 '12 at 00:57

reve_etrange

2,561
1
22
36

votes

1 answer

What is __m128d?

I really can't get what "keyword" like __m128d is in C++. I'm using MSVC, and it says: The __m128d data type, for use with the Streaming SIMD Extensions 2 instructions intrinsics, is defined in . So, is it a Data Type? typedef? If I…

c++ intel intrinsics sse2

asked Dec 13 '18 at 08:19

markzzz

47,390
120
299
507

votes

1 answer

What is the difference between loadu_ps and set_ps when using unformatted data?

I have some data that isn't stored as structure of arrays. What is the best practice for loading the data in registers? __m128 _mm_set_ps (float e3, float e2, float e1, float e0) // or __m128 _mm_loadu_ps (float const* mem_addr) With _mm_loadu_ps,…

sse simd intrinsics sse2

asked Mar 13 '18 at 20:50

scx

3,221
1
19
37

votes

2 answers

Convert _mm_shuffle_epi32 to C expression for the permutation?

I'm working on a port of SSE2 to NEON. The port is early stage and it's producing incorrect results. Part of the reason for the incorrect results is _mm_shuffle_epi32 and the NEON instructions I selected. The documentation for _mm_shuffle_epi32 is…

x86 x86-64 sse shuffle sse2

asked May 07 '16 at 04:04

jww

97,681
90
411
885

votes

4 answers

Detect the availability of SSE/SSE2 instruction set in Visual Studio

How can I check in code whether SSE/SSE2 is enabled or not by the Visual Studio compiler? I have tried #ifdef __SSE__ but it didn't work.

c++ visual-studio x86 sse sse2

asked Sep 01 '13 at 23:38

user2202420

votes

2 answers

Using XMM0 register and memory fetches (C++ code) is twice as fast as ASM only using XMM registers - Why?

I'm trying to implement some inline assembler (in Visual Studio 2012 C++ code) to take advantage of SSE. I want to add 7 numbers for 1e9 times so i placed them from RAM to xmm0 to xmm6 registers of CPU. when i do it with inline assembly in visual…

c++ performance optimization assembly sse2

asked Mar 11 '13 at 21:46

epsi1on

votes

0 answers

Can Visual Studio tell me the SSE2 register spill count of compiled code?

I do not have any real compiler knowledge, and I used to hand-code SSE2 functions for selected pieces of code. I know how to read the generated machine code, but largely unaware of the crazy optimizations made possible by compilers. All of my work…

visual-studio optimization compiler-construction sse2

asked Jul 28 '11 at 20:37

rwong

6,062
1
23
51

votes

1 answer

What is the point of SSE2 instructions such as orpd?

The orpd instruction is a "bitwise logical OR of packed double precision floating point values". Doesn't this do exactly the same thing as por ("bitwise logical OR")? If so, what's the point of having it?

assembly x86 sse instruction-set sse2

asked May 31 '20 at 05:28

tbodt

16,609
6
58
83

votes

1 answer

How to convert scalar code of the double version of VDT's Pade Exp fast_ex() approx into SSE2?

Here's the code I'm trying to convert: the double version of VDT's Pade Exp fast_ex() approx (here's the old repo resource): inline double fast_exp(double initial_x){ double x = initial_x; double px=details::fpfloor(details::LOG2E * x…

c++ sse intrinsics sse2 exp

asked Jan 25 '19 at 11:44

markzzz

47,390
120
299
507

votes

3 answers

SIMD: Why is the SSE RGB to YUV color conversion about the same speed as the c++ implementation?

I've just tried to optimize an RGB to YUV420 converter. Using a lookup table yielded a speed increase, as did using fixed point arithmetic. However I was expecting the real gains using SSE instructions. My first go at it resulted in slower code and…

c++ optimization rgb yuv sse2

asked Jan 28 '11 at 14:08

Ralf

9,405
2
28
46

votes

1 answer

How to extract bytes from an SSE2 __m128i structure?

I'm a beginner with SIMD intrinsics, so I'll thank everyone for their patience in advance. I have an application involving absolute difference comparison of unsigned bytes (I'm working with greyscale images). I tried AVX, more modern SSE versions…

c image-processing vectorization simd sse2

asked Oct 05 '16 at 22:59

sacheie

votes

6 answers

How to optimize a cycle?

I have the following bottleneck function. typedef unsigned char byte; void CompareArrays(const byte * p1Start, const byte * p1End, const byte * p2, byte * p3) { const byte b1 = 128-30; const byte b2 = 128+30; for (const byte * p1 =…

c++ optimization assembly intrinsics sse2

asked Oct 21 '10 at 11:40

Alexey Malistov

26,407
13
68
88

votes

1 answer

gcc -mno-sse2 rounding

I'm doing a project where I do RGB to luma conversions, and I have some rounding issues with the -mno-sse2 flag: Here's the test code: #include #include static double rec709_luma_coeff[3] = {0.2126, 0.7152, 0.0722}; int…

c gcc compilation rounding sse2

asked Jan 28 '16 at 18:26

user3618511

Prev 1 2

…

18 19 Next