Questions tagged [sse2]

x86 Streaming SIMD Extensions 2 adds support for packed integer and double-precision floats in the 128-byte XMM vector registers. It is always supported on x86-64, and supported on every x86 CPU from 2003 or later.

See the x86 tag wiki for guides and other resources for programming and optimising programs using x86 vector extensions, and the SSE tag wiki for other SSE- and SSE2-related resources.


SSE2 is one of the SSE family of x86 instruction-set extensions.

SSE2 adds support for double-precision floating point, and packed-integer (8bit to 64bit elements) in XMM registers. It is baseline in x86-64, so 64bit code can always assume SSE2 support, without having to check. 32bit code could still be run on a CPU from before 2003 (Athlon XP or Pentium III) that didn't support SSE2, but this is unlikely for most newly-written code. (And so an MMX or original-SSE fallback is not worth writing.)

Most tasks that benefit from vectors at all can be done fairly efficiently using only instructions up to SSE2. This is fortunate, because widespread support for later SSE versions took time. Use of later SSE extensions typically saves a couple instructions here and there, usually with only minor speed-ups. Notably absent until SSSE3 was PSHUFB, a shuffle whose operation was controlled by elements in a register, rather than a compile-time constant imm8. It can do things that SSE2 can't do efficiently at all.

AVX provides 3-operand versions of all SSE2 instructions.

History

Intel introduced SSE2 with their Pentium 4 design in 2001.

SSE2 was adopted by AMD for its 64bit CPU line in 2003/2004. As of 2009 there remain few if any x86 CPUs (at least, in any significant numbers) that do not support the SSE2 instruction set, which makes it extremely attractive on the Windows PC platform by offering a large feature set that can practically be assumed a "minimum requirement" that will be omnipresent (which, however, at least in 32bit mode, does not remove the necessity to check processor features).

More recent instruction sets introduce fewer features which are often highly specialized, and are at the same time supported inconsistenly between manufacturers by a significantly smaller share of processors (10-50% in 2009).

SSE2 does not offer instructions for horizontal addition, which are needed for some geometric calculations (e.g. dot product) and complex arithmetic. This functionality has to be emulated with one or several shuffles, which however are often not significantly slower than the dedicated instructions in higher revisions.

275 questions
4
votes
1 answer

Converting 24 to 16 bit audio using SSE/simd instructions

I wonder if there is any fast method to do a 24 bit to 16 bit quantization on an array of audio samples (using intrinsics or asm). Source format is signed 24 le. Update : Managed to get the conversion done like described : static void __cdecl…
ohrfritz
  • 41
  • 2
4
votes
1 answer

Optimal ordering of memory read and write assembly instructions

I am wondering what the optimal order is for a sequence of instructions like the one below on Intel processors between Core 2 and Westmere. This is AT&T syntax, so that the pxor instructions are memory reads, and the movdqa are memory writes: …
Pascal Cuoq
  • 79,187
  • 7
  • 161
  • 281
4
votes
2 answers

accelerate rgb planar to rgba interleaved conversion using sse or mmx

I have to pass medical image data retrieved from one proprietary device SDK to an image processing function in another - also proprietary - device SDK from a second vendor. The first function gives me the image in a planar rgb format: int…
Veterinarian
  • 69
  • 1
  • 6
4
votes
2 answers

Penalty for switching from SSE to AVX?

I'm aware of the existing penalty for switching from AVX instructions to SSE instructions without first zeroing out the upper halves of all ymm registers, but in my particular case on my machine (i7-3939K 3.2GHz), there seems to be a very large…
Kumputer
  • 588
  • 1
  • 6
  • 22
4
votes
2 answers

How to vectorize a distance calculation using SSE2

A and B are vectors or length N, where N could be in the range 20 to 200 say. I want to calculate the square of the distance between these vectors, i.e. d^2 = ||A-B||^2. So far I have: float* a = ...; float* b = ...; float d2 = 0; for(int k = 0; k…
Bull
  • 11,771
  • 9
  • 42
  • 53
4
votes
1 answer

Vectorized extraction of a specific pattern of shorts from an array, and also insertion into a new array

I have an array of shorts where I want to grab half of the values and put them in a new array that is half the size. I want to grab particular values in this sort of pattern, where each block is 128 bits (8 shorts). This is the only pattern I will…
user173342
  • 1,820
  • 1
  • 19
  • 45
4
votes
1 answer

How do I add all elements in an array using SSE2?

Suppose I have a very simple code like: double array[SIZE_OF_ARRAY]; double sum = 0.0; for (int i = 0; i < SIZE_OF_ARRAY; ++i) { sum += array[i]; } I basically want to do the same operations using SSE2. How can I do that?
Peter Lee
  • 173
  • 1
  • 4
  • 8
4
votes
2 answers

SSE2, Visual Studio 2010, and Debug Build

Can the compiler make automatic use of SSE2 while optimisations are disabled? When optimisations are disabled, does the /arch:SSE2 flag mean anything? I've been given the task of squeezing more performance out of our software. Unfortunately, release…
Anthony
  • 12,177
  • 9
  • 69
  • 105
4
votes
1 answer

A better SSE2 implementation for float4::set_wxy (and other set-swizzle ops)?

I'm writing an HLSL float4 compliant type in C++ with SSE2/AVX intrinsics and at the moment I'm implementing all the set-swizzle operations available for float4 in HLSL. I'm trying to figure out an optimal SSE2 implementation to deal with…
snk_kid
  • 3,457
  • 3
  • 23
  • 18
4
votes
1 answer

How to optimize this Delphi function with SSE2?

I need a hint, how to implement this Delphi function using SSE2 assembly (32 Bit). Other optimizations are welcome too. Maybe one can tell me, what kind of instructions could be used, so I have a starting point for further reading. Actual: const…
Steffen Binas
  • 1,463
  • 20
  • 30
3
votes
1 answer

sse2 float multiplication

I tried to port code some from the FANN Lib (neuronal network written in C) to SSE2. But the SSE2 performance got worse than the normal code. With my SSE2 implementation runs one run takes 5.50 min without 5.20 min. How could SSE2 be slower than…
martin s
  • 1,121
  • 1
  • 12
  • 29
3
votes
1 answer

What happens on an unaligned MOVSD on various CPUs?

Basically what the question says, if I execute a MOVSD that isn't 8-byte (or even 4-byte) aligned on various CPUs, what happens? Does it have a performance impact, can it segfault, etc.?
Alex Gaynor
  • 14,353
  • 9
  • 63
  • 113
3
votes
1 answer

Can FP compares like SSE2 _mm_cmpeq_pd be used to compare 64 bit integers?

Can FP compares like SSE2 _mm_cmpeq_pd / AVX _mm_cmp_pd be used to compare 64 bit integers? The idea is to emulate missing _mm_cmpeq_epi64 that would be similar to _mm_cmpeq_epi8, _mm_cmpeq_epi16, _mm_cmpeq_epi32. The concern is I'm not sure if the…
Alex Guteniev
  • 12,039
  • 2
  • 34
  • 79
3
votes
1 answer

Is there a difference between SVML vs. normal intrinsic square root functions?

Is there any sort of difference in precision or performance between normal sqrtps/pd or the SVML version: __m128d _mm_sqrt_pd (__m128d a) [SSE2] __m128d _mm_svml_sqrt_pd (__m128d a) [SSE?] __m128 _mm_sqrt_ps (__m128 a) [SSE] …
dave_thenerd
  • 448
  • 3
  • 10
3
votes
1 answer

Tweaking MIT's bitcount algorithm to count words in parallel?

I want to use a version of the well known MIT bitcount algorithm to count neighbors in Conway's game of life using SSE2 instructions. Here's the MIT bitcount in c, extended to count bitcounts > 63 bits. int bitCount(unsigned long long n) { unsigned…
Johan
  • 74,508
  • 24
  • 191
  • 319