Questions tagged [sse2]

x86 Streaming SIMD Extensions 2 adds support for packed integer and double-precision floats in the 128-byte XMM vector registers. It is always supported on x86-64, and supported on every x86 CPU from 2003 or later.

See the x86 tag wiki for guides and other resources for programming and optimising programs using x86 vector extensions, and the SSE tag wiki for other SSE- and SSE2-related resources.


SSE2 is one of the SSE family of x86 instruction-set extensions.

SSE2 adds support for double-precision floating point, and packed-integer (8bit to 64bit elements) in XMM registers. It is baseline in x86-64, so 64bit code can always assume SSE2 support, without having to check. 32bit code could still be run on a CPU from before 2003 (Athlon XP or Pentium III) that didn't support SSE2, but this is unlikely for most newly-written code. (And so an MMX or original-SSE fallback is not worth writing.)

Most tasks that benefit from vectors at all can be done fairly efficiently using only instructions up to SSE2. This is fortunate, because widespread support for later SSE versions took time. Use of later SSE extensions typically saves a couple instructions here and there, usually with only minor speed-ups. Notably absent until SSSE3 was PSHUFB, a shuffle whose operation was controlled by elements in a register, rather than a compile-time constant imm8. It can do things that SSE2 can't do efficiently at all.

AVX provides 3-operand versions of all SSE2 instructions.

History

Intel introduced SSE2 with their Pentium 4 design in 2001.

SSE2 was adopted by AMD for its 64bit CPU line in 2003/2004. As of 2009 there remain few if any x86 CPUs (at least, in any significant numbers) that do not support the SSE2 instruction set, which makes it extremely attractive on the Windows PC platform by offering a large feature set that can practically be assumed a "minimum requirement" that will be omnipresent (which, however, at least in 32bit mode, does not remove the necessity to check processor features).

More recent instruction sets introduce fewer features which are often highly specialized, and are at the same time supported inconsistenly between manufacturers by a significantly smaller share of processors (10-50% in 2009).

SSE2 does not offer instructions for horizontal addition, which are needed for some geometric calculations (e.g. dot product) and complex arithmetic. This functionality has to be emulated with one or several shuffles, which however are often not significantly slower than the dedicated instructions in higher revisions.

275 questions
0
votes
2 answers

replicating x64 MOVQ in x86 assembly

How could i go about replicating a x64 MOVQ (move quad word) instruction in x86 assembly? For example. Given: movq xmm5, [esi+2h] movq [edi+f1h], xmm5 Would this work? : push eax push edx mov eax, [esi+2h] mov edx, [esi+6h] ; +4 byte offset …
0
votes
2 answers

GDB is reporting EXC_BAD_ACCESS, when manipulating SSE2 registers

So I'm trying to code an AESNI library. When I compile my program with symbols and run it in GDB. I get the following error: Program received signal EXC_BAD_ACCESS, Could not access memory. Reason: 13 at address: 0x0000000000000000 Code: (g++ -g…
Nocturnal
  • 683
  • 7
  • 25
0
votes
2 answers

SSE2 - "The system cannot execute the specified program"

I recently developed a Visual C++ console application which uses inline SSE2 instructions. It works fine on my computer, but when I tried it on another, it returns the following error: The system cannot execute the specified program Note that the…
Jacob
  • 34,255
  • 14
  • 110
  • 165
0
votes
3 answers

SSE2 - 16-byte aligned dynamic allocation of memory

EDIT: This is a followup to SSE2 Compiler Error This is the real bug I experienced before and have reproduced below by changing the _mm_malloc statement as Michael Burr suggested: Unhandled exception at 0x00415116 in SO.exe: 0xC0000005: Access…
Jacob
  • 34,255
  • 14
  • 110
  • 165
0
votes
1 answer

Moving a quadword number to xmm registers

I am trying to move a number in a 64-bit register to an xmm register to do arithmetic. My thinking was: movq xmm1, r14 In my program r14 is holding the counter and I need it to get moved into xmm1 so I can divide it with the sum of numbers i have…
GolfinGamer
  • 39
  • 2
  • 8
0
votes
1 answer

Where can I find a good implementation of exp(double) using SSE2 instructions on x86/x64?

I've established that the Microsoft implementations of exp(double) in the VS2010 C library use different algorithms on Win32 (i.e. 32-bit x86) and x64 platforms, even though I've enabled SSE2 for the x86 platform and verified that the SSE2 code path…
dc42
  • 314
  • 3
  • 6
0
votes
0 answers

Is it possible to use C# with Salsa 20/12 so that Intel/AMD SSE2 acceleration is used?

I'm interested in the eStream project and using C# to encrypt / decrypt data streams with Intel/AMD acceleration. How can I use C# to interact with Intel/AMD hardware so I can get the following algorithms to work: Salsa 20/12 Sosamaunk
makerofthings7
  • 60,103
  • 53
  • 215
  • 448
0
votes
0 answers

sse2 multiplication vectors X and Y using multithreaded algorithm in cpp

So my code for thread is: DWORD WINAPI ThreadFunc1(LPVOID lpParam ) { THREAD_DATA *ptrDat = (THREAD_DATA *)(lpParam); int loc_N = ptrDat->loc_N ; int ntimes = ptrDat->ntimes; __m128d rx0, ry0, result0; for( int ip= 0; ip < ntimes; ip++ ) { …
Mariola
  • 249
  • 7
  • 16
0
votes
1 answer

converting four floats in xmm3 to four ints in memory

I am newbie to sse, and I have trouble to find it, please tell me what is the good way to convert (truncate as in "(int) float_") four packed floats I have in xmm3 register into four ints and store it into memory (some like "movaps oword…
grunge fightr
  • 1,360
  • 2
  • 19
  • 38
0
votes
2 answers

SQRT vs RSQRT vs SSE _mm_rsqrt_ps Benchmark

I have not found any clear benchmark about this subject so I made one. I will post it here in case anybody is looking for this like me. I have one question though. Isn't SSE supposed to be 4 times faster than four fpu RSQRT in a loop? It is faster…
Etherealone
  • 3,488
  • 2
  • 37
  • 56
0
votes
1 answer

Call a function lower in the script from a function higher in the script

I'm trying to come up with a way to make the computer do some work for me. I'm using SIMD (SSE2 & SSE3) to calculate the cross product, and I was wondering if it could go any faster. Currently I have the following: const int maskShuffleCross1 =…
knight666
  • 1,599
  • 3
  • 22
  • 38
0
votes
1 answer

Image quality is decresing when MMX SSE to C code conversion

I am Converting an MMX SSE to Equivalent C Code. I have almost converted it but the image quality what I am getting is not proper or I can see some noise is coming in image. I am debugging the code from last 5 days but I am not getting any reason…
0
votes
2 answers

SSE 2 function execution timing not constant and is more than normal

Using SSE 2, on Intel core2Duo. The time spent in sse_add() and normal_add() is not constant in multiple run, and in fact now after several modifications is always coming out as 0. The program basically finds the sum of each of the columns of the…
gpuguy
  • 4,607
  • 17
  • 67
  • 125
0
votes
2 answers

Adding two __m128 types via Accelerate framework

I need to add/mul/sub two __m128 (float) variables using Accelerate framework. But, I can't find function to do that. All Accelerate framework functions takes int__vector__ type instead float__vector__ type. I find function for dividing 'vdivf',…
Lexandr
  • 679
  • 1
  • 6
  • 22
-1
votes
2 answers

SIMD code vs Scalar Code

The following loop is executed hundreds of times. elma and elmc are both unsigned long (64-bit) arrays, so is res1 and res2. unsigned long simdstore[2]; __m128i *p, simda, simdb, simdc; p = (__m128i *) simdstore; for (i = 0; i < _polylen;…
anup
  • 529
  • 5
  • 14
1 2 3
18
19