Questions tagged [sse2]

x86 Streaming SIMD Extensions 2 adds support for packed integer and double-precision floats in the 128-byte XMM vector registers. It is always supported on x86-64, and supported on every x86 CPU from 2003 or later.

See the x86 tag wiki for guides and other resources for programming and optimising programs using x86 vector extensions, and the SSE tag wiki for other SSE- and SSE2-related resources.


SSE2 is one of the SSE family of x86 instruction-set extensions.

SSE2 adds support for double-precision floating point, and packed-integer (8bit to 64bit elements) in XMM registers. It is baseline in x86-64, so 64bit code can always assume SSE2 support, without having to check. 32bit code could still be run on a CPU from before 2003 (Athlon XP or Pentium III) that didn't support SSE2, but this is unlikely for most newly-written code. (And so an MMX or original-SSE fallback is not worth writing.)

Most tasks that benefit from vectors at all can be done fairly efficiently using only instructions up to SSE2. This is fortunate, because widespread support for later SSE versions took time. Use of later SSE extensions typically saves a couple instructions here and there, usually with only minor speed-ups. Notably absent until SSSE3 was PSHUFB, a shuffle whose operation was controlled by elements in a register, rather than a compile-time constant imm8. It can do things that SSE2 can't do efficiently at all.

AVX provides 3-operand versions of all SSE2 instructions.

History

Intel introduced SSE2 with their Pentium 4 design in 2001.

SSE2 was adopted by AMD for its 64bit CPU line in 2003/2004. As of 2009 there remain few if any x86 CPUs (at least, in any significant numbers) that do not support the SSE2 instruction set, which makes it extremely attractive on the Windows PC platform by offering a large feature set that can practically be assumed a "minimum requirement" that will be omnipresent (which, however, at least in 32bit mode, does not remove the necessity to check processor features).

More recent instruction sets introduce fewer features which are often highly specialized, and are at the same time supported inconsistenly between manufacturers by a significantly smaller share of processors (10-50% in 2009).

SSE2 does not offer instructions for horizontal addition, which are needed for some geometric calculations (e.g. dot product) and complex arithmetic. This functionality has to be emulated with one or several shuffles, which however are often not significantly slower than the dedicated instructions in higher revisions.

275 questions
2
votes
0 answers

unresolved external symbol _WebPInitUpsamplersSSE2 - why?

I have compiled QT5.6.2 myself with Visual Studio 2015 to compile some other software (Telegram desktop; build instructions are here https://github.com/telegramdesktop/tdesktop/blob/dev/docs/building-msvc.md#setup-gypninja-and-generate-vs-solution…
IceFire
  • 4,016
  • 2
  • 31
  • 51
2
votes
0 answers

x86_64 crash because incorrect bytecode

I try to find the root cause of an infrequent crash. I build code on win32 while run on win64. The compiler of my native simd code is nasm, the command line is nasm -f win32 -DPREFIX -I./asm/ -o $(IntDir)%(Filename).obj %(FullPath) The original…
charlie
  • 65
  • 5
2
votes
1 answer

Finding a median of 3 values using SSE2 instruction set

My input data is 16-bit data, and I need to find a median of 3 values using SSE2 instruction set. If I have 3 16-bits input values A, B and C, I thought to do it like this: D = max( max( A, B ), C ) E = min( min( A, B ), C ) median = A + B + C - D -…
BЈовић
  • 62,405
  • 41
  • 173
  • 273
2
votes
1 answer

how to copy bytes into xmm0 register

I have the following code which works fine but seems inefficient given the end result only requiring the data in xmm0 mov rcx, 16 ; get first word, up to 16 bytes mov rdi, CMD ; ...and put…
poby
  • 1,572
  • 15
  • 39
2
votes
1 answer

Any preference to SHUFPD or PSHUFD for reversing two packed double in an XMM?

Question today is fairly short. Consider the following toy C program shuffle.c for reversing two packed double in register xmm0: #include void main () { double x[2] = {0.0, 1.0}; asm volatile ( "movupd (%[x]), %%xmm0\n\t" …
Zheyuan Li
  • 71,365
  • 17
  • 180
  • 248
2
votes
1 answer

SSE: How to extract the sign bit for each packed byte, into a packed register?

Given packed bytes in xmm0, what is an efficient way to extract the sign (i.e. highest-order) bit of each byte into xmm1? In other words I want to compute the logical AND with 0x80 for each packed byte. For example: xmm0: 0xff 0xef 0x80 0x7f 0x01…
jacobsa
  • 5,719
  • 1
  • 28
  • 60
2
votes
2 answers

Complex data reorganization with vector instructions

I need to load and rearrange 12 bytes into 16 (or 24 into 32) following the pattern below: ABC DEF GHI JKL becomes ABBC DEEF GHHI JKKL Can you suggest efficient ways to achieve this using the SSE(2) and/or AVX(2) instructions ? This needs to be…
user1196549
2
votes
1 answer

SSE instruction MOVSD (extended: floating point scalar & vector operations on x86, x86-64)

I am somehow confused by the MOVSD assembly instruction. I wrote some numerical code computing some matrix multiplication, simply using ordinary C code with no SSE intrinsics. I do not even include the header file for SSE2 intrinsics for…
Zheyuan Li
  • 71,365
  • 17
  • 180
  • 248
2
votes
3 answers

Why can't I remove _mm_empty()?

I have a c++ function with some SSE2 instructions. The problem is i am getting the following linker error when compiling this code using microsoft visual c++: unresolved external symbol _m_empty referenced in function "void * __cdecl process(void…
Mbt925
  • 1,317
  • 1
  • 16
  • 31
2
votes
1 answer

Test for SSE2 using CPUID versus trying SSE2 instruction and SIGILL?

I'm looking at some library code that performs the following. The CpuId function operates as expected. It loads EAX (function), ECX (subfunction) and then calls CPUID. struct CPUIDinfo { word32 EAX; word32 EBX; word32 ECX; word32…
jww
  • 97,681
  • 90
  • 411
  • 885
2
votes
2 answers

How the following following SSE2 code read data

I have found following SSE2 code written to multiply 2x2 matrix. Can anybody explain me how this code is executing. When I go through the code I feel it just add values into two positions of C(2x2) matrix (C[0],C[3]). lda is the size of the large…
user3817989
  • 715
  • 1
  • 8
  • 11
2
votes
2 answers

speed up Matrix Multiplication by SSE2

I want to know how speed up matrix multiplication by SSE2 here is my code int mat_mult_simd(double *a, double *b, double *c, int n) { __m128d c1,c2,a1,a2,b1; for(int i=0; i
2
votes
3 answers

Store four 16bit integers with SSE intrinsics

I multiply and round four 32bit floats, then convert it to four 16bit integers with SSE intrinsics. I'd like to store the four integer results to an array. With floats it's easy: _mm_store_ps(float_ptr, m128value). However I haven't found any…
plasmacel
  • 8,183
  • 7
  • 53
  • 101
2
votes
1 answer

How to achieve 8bit madd using SSE2

Reading from the official Intel C++ Intrinsic Reference, SSE 2 has the following command __m128i _mm_madd_epi16(__m128i a, __m128i b) Multiplies the 8 signed 16-bit integers from a by the 8 signed 16-bit integers from b. Adds the signed 32-bit…
adkalkan
  • 69
  • 1
  • 7
2
votes
1 answer

SSE2 Compiler Error

I'm trying to break into SSE2 and tried the following example program: #include "stdafx.h" #include int main(int argc, char* argv[]) { __declspec(align(16)) long mul; // multiply variable __declspec(align(16)) int t1[100000]; //…
Jacob
  • 34,255
  • 14
  • 110
  • 165