Questions tagged [sse]

SSE (Streaming SIMD Extensions) was the first of many similarly-named vector extensions to the x86 instruction set. At this point, SSE more often a catch-all for x86 vector instructions in general, and not a reference to SSE without SSE2, SSE3, etc. (For Server-Sent Events use [server-sent-events] tag instead)

See the x86 tag wiki for guides and other resources for programming and optimising programs using x86 vector extensions.

SIMD / SSE basics: What are the 128-bit to 512-bit registers used for? with links to many examples.

SSE/SIMD vector programming guides, focused on the SIMD aspect rather than general x86:

Agner Fog's Optimizing Assembly guide has a chapter on vectors, including tables of data movement instructions: broadcasts within a vector, combine data between two vectors, different kinds of shuffles, etc. It's great for finding the right instruction (on intrinsic) for the data movement you need.
Crunching Numbers with AVX and AVX2: An intro with examples of using C++ intrinsics
Slides + text: SIMD at Insomniac Games (GDC 2015): intro to SIMD, and some specific examples: checking all doors against all characters in a level. Advanced tricks: Filtering an array into a smaller array (using Left-packing based on a compare mask), with an SSSE3 pshufb solution and an SSE2 move-distance solution. Also: generating N-bit masks for variable-per-element N. Including a clever float-exponent based SSE2 version.

Instruction-set / intrinsics reference guides (see the x86 tag wiki for more links)

Intel's vector intrinsics finder/search (very good): search by asm mnemonic or C intrinsic name. Filter by type and/or by instruction-set extension family (e.g. exclude AVX512 and later). Occasionally buggy, esp. the performance info. (Look at Agner Fog's tables for performance info, although it has occasional errors or typos, too).
Intel's manuals, including instruction set reference manual. Very detailed description of what every instruction does, using pseudo-code. These manuals are accurate much more often than the intrinsics guide.
x86/x64 SIMD Instruction List (SSE to AVX512) Beta: A nice compact table listing instruction mnemonics and their intrinsics, broken down by type and element-size. Detailed pages with graphical data-movement diagrams for each instruction.

Miscellaneous specific things:

Shuffling by mask with Intel AVX explains how shuffle-control vectors and _MM_SHUFFLE work, including in-lane vs. lane-crossing for AVX.
SSE interleave/merge/combine 2 vectors using a mask, per-element conditional move? Blends, especially variable blends (blendvps)
What are the best instruction sequences to generate vector constants on the fly?. In C/C++, almost always prefer _mm_set or _mm_set1 to initialize local variables (not globals), rather than defining arrays and loading from them.
print a __m128i variable: How to safely and portably access the elements of a vector, and how to debug-print them.

Streaming SIMD Extensions (SSE) basics

Together, the various SSE extensions allow working with 128b vectors of float, double, or integer (from 8b to 64b) elements. There are instructions for arithmetic, bitwise operations, shuffles, blends (conditional moves), compares, and some more-specialized operations (e.g. SAD for multimedia, carryless-multiply for crypto/finite-field math, strings (for strstr() and so on)). FP sqrt is provided, but unlike the x87 FPU, math library functions like sin must be implemented by software. SSE for scalar FP math has replaced x87 floating point, now that hardware support is near-universal.

Efficient use usually requires programs to store their data in contiguous chunks, so it can be loaded in chunks of 16B and used without too much shuffling. SSE doesn't offer loads / stores with a stride, only packed. (SoA vs. AoS: structs-of-arrays vs. arrays-of-structs). Alignment requirements on memory operands can also be a hurdle, even though modern hardware has fast unaligned loads/stores.

While there are many instructions available, the instruction set is not very orthogonal. It's not uncommon to find the operation you need, but only available for elements of a different size than you're working with. Another good example is that floating point shuffles (SHUFPS) have different semantics than 32b-integer shuffles (PSHUFD).

Details

SSE added new architectural registers (xmm0-xmm7, 128b each (xmm0-xmm15 in 64bit mode)), requiring OS support to save/restore them on context switches. The previous MMX extensions (for integer SIMD) reused the x87 FP registers.

Intel introduced MMX, original-SSE, SSE2, SSE3, SSSE3, SSE4.1, and SSE4.2. AMD's XOP (a revision of their SSE5 plans) was never picked up by Intel, and will be dropped even by future AMD designs. The instruction-set war between Intel and AMD has led to many sub-optimal results, which makes instruction decoders in CPUs require more power and transistors. (And limits opportunity for further extensions).

"SSE" commonly refers to the whole family of extensions. Writing programs that make sure to only use instructions supported by the machine they run on is necessary, and implied, and not worth cluttering our language with. (Setting function pointers is a good way to detect what's supported once at startup, avoiding a branch to select an appropriate function every time one is needed.)

Further SSE extensions are not expected: AVX introduced a new 3-operand version of all SSE instructions, as well as some new features (including dropping alignment requirements, except for explicitly-aligned moves like vmovdqa). Further vector extensions will be called AVX-something, until Intel comes up with something different enough to change the name again.

History

SSE, first introduced with the Pentium III in 1999, was Intel's reply to AMD's 3DNow extension in 1998.

The original-SSE added vector single-precision floating point math. Integer instructions to operate on xmm registers (instead of 64bit mmx regs) didn't appear until SSE2.

Original-SSE can be considered somewhat half-hearted insofar as it only covered the most basic operations and suffered from severe limitations both in functionality and performance, making it mostly useful for a few select applications, such as audio or raster image processing.

Most of SSE's limitations have been ameliorated with the SSE2 instruction set, the only notable limitation remaining to date is the lack of horizontal addition or a dot product operation in both an efficient way and widely available. While SSE3 and SSE4.1 added horizontal add and dot product instructions, they're usually slower than manual shuffle+add. Only use them at the end of a loop.

The lack of cross-manufacturer support made software development with SSE a challenge during the initial years. With AMD's adoption of SSE2 into its 64bit processors during 2003/2004, this problem gradually disappeared. As of today, there exist virtually no processors without SSE/SSE2 support. SSE2 is part of x86-64 baseline, with twice as many vector registers available in 64bit mode.

2314 questions

votes

3 answers

assembly intrinsic to do a masked load

int main() { const int STRIDE=2,SIZE=8192; int i=0; double u[SIZE][STRIDE]; #pragma vector aligned for(i=0;i

c assembly sse simd intrinsics

asked Nov 03 '12 at 00:12

arunmoezhi

3,082
6
35
54

votes

1 answer

SSE unsigned/signed subtraction of 16 bit register

I have a __m128i register (Vector A) with 16 bit values with the content: {100,26,26,26,26,26,26,100} // A Vector Now I subtract the vector {82,82,82,82,82,82,82,82} With the instruction _mm_sub_epi16(a_vec,_mm_set1_epi16(82)) The expected…

c performance sse

asked Oct 31 '12 at 12:17

martin s

1,121
1
12
29

votes

3 answers

ZeroMemory in SSE

I need simple ZeroMemory implementation with SSE (SSE2 prefered) Can someone help with that. I was serching thru SO and net but not found direct answer to that.

optimization assembly x86 sse

asked Oct 08 '12 at 17:50

grunge fightr

1,360
2
19
38

votes

0 answers

SSE floating point dot product for dummies

I have read many SO questions about SSE/SIMD (e.g., Getting started with SSE), but I'm still confused by all of it. All I want is a dot product between two double precision floating-point vectors, in C (C99 FWIW). I'm using GCC. Can someone post a…

gcc sse simd dot-product

asked Oct 05 '12 at 03:33

purple51

votes

1 answer

Are arrays initialized like `float[10][10]` already memory aligned for SIMD/SSE?

I need to optimize my matrix multiplication by using SIMD/Intel SSE. The example code given looks like: *x = (float*)memalign(16, size * sizeof(float)); However, I am using C++ and [found that][1] I instead of malloc (before doing SIMD), I should…

c++ sse simd

asked Oct 03 '12 at 13:33

Jiew Meng

84,767
185
495
805

votes

6 answers

What's the most efficient way to multiply 4 floats by 4 floats using SSE?

I currently have the following code: float a[4] = { 10, 20, 30, 40 }; float b[4] = { 0.1, 0.1, 0.1, 0.1 }; asm volatile("movups (%0), %%xmm0\n\t" "mulps (%1), %%xmm0\n\t" "movups %%xmm0, (%1)" …

c gcc assembly sse sse2

asked Aug 04 '09 at 12:34

horseyguy

29,455
20
103
145

votes

1 answer

is it safe to use xmm registers to save the general-purpose ones?

pushf //couldnt store this in other registers movd xmm0,eax//storing in xmm registers instead of pushing movd xmm1,ebx// movd xmm2,ecx// movd xmm3,edx// movd xmm4,edi//end of push backups …

assembly x86 sse inline-assembly

asked Jul 24 '12 at 12:09

huseyin tugrul buyukisik

11,469
4
45
97

votes

1 answer

Does iPhone support SSE2?

There are so many statements in my code containing __m128i,_mm_loadu_si128,_mm_avg_epu8 and many more. This things work on Mac but fails to compile in ios. What are the replacements for these in IOS?

iphone ios sse

asked Jun 28 '12 at 11:42

pradeepa

4,104
5
31
41

votes

1 answer

(a*b)/256 and MMX

I'm wondering if it is possible to do the following calculation with four values parallel within a MMX-Register: (a*b)/256 where a is a signed word and b is an unsigned value (blend factor) in the range of 0-256 I think my problem is that I'm not…

assembly sse mmx

asked Jun 22 '12 at 13:44

jsi1

votes

1 answer

How to align 16-bit ints for use with SSE intrinsics

I am working with two-dimensional arrays of 16-bit integers defined as int16_t e[MAX_SIZE*MAX_NODE][MAX_SIZE]; int16_t C[MAX_SIZE][MAX_SIZE]; Where Max_SIZE and MAX_NODE are constant values. I'm not a professional programmer, but somehow with the…

c sse simd memory-alignment sse2

asked Jun 16 '12 at 21:31

SMir

votes

1 answer

Implementation and performance of using bitsets with SSE

I am trying to speed up my method using SSE (On Visual Studio). I am a novice in the area. The main data types I work with in my method are bitsets of size 32 and the logical operation I mainly use is the AND operation (with _BitScanForward scarcely…

x86 sse simd bitset

asked May 29 '12 at 15:28

SMir

votes

1 answer

How to count the number of bytes which lies in some range using SSE?

I want to write a c program which counts the number of bytes in a range a...c with below code: char a[16], b[16], c[16]; int counter = 0; for(i = 0; i < 16; i++) { if((a[i] < b[i]) && (b[i] < c[i])) counter++; } return counter; …

x86 sse simd

asked May 15 '12 at 21:38

quartz

votes

2 answers

What is the correct way of calculating a large CRC32

Here is an article that describes how to calculate CRC32 of maximum 1024 bytes using the built in CRC32 instruction found in modern x86-64 processors. However, I need to calculate CRC32 of more than 1024 bytes. Would it be a correct approach to…

c x86-64 sse crc32

asked Apr 26 '12 at 12:57

pythonic

20,589
43
136
219

votes

1 answer

How to do aligned additions without aligned arrays

So i was trying to do an array operation that looked something like for (int i=0;i++i<32) { output[offset+i] += input[i]; } where output and input are float arrays (which are 16-byte aligned thanks to malloc). However, I can't gurantee that…

c sse simd

asked Apr 24 '12 at 02:55

John Palmer

25,356
3
48
67

vote

2 answers

Calling different implementation of function based on SSE features

I am designing a series of Vector classes in C++ that support SSE(SIMD). The operators have been overloaded for convenience. Example of class: class vector2 { public: //...code friend const vector2 operator+ (const vector2 & lhs, const vector2 &…

c++ object operator-overloading sse

asked Apr 02 '12 at 22:38

Daniel Samson

Prev 1 2 3

…

99 100 Next