Questions tagged [sse]

SSE (Streaming SIMD Extensions) was the first of many similarly-named vector extensions to the x86 instruction set. At this point, SSE more often a catch-all for x86 vector instructions in general, and not a reference to SSE without SSE2, SSE3, etc. (For Server-Sent Events use [server-sent-events] tag instead)

See the x86 tag wiki for guides and other resources for programming and optimising programs using x86 vector extensions.

SIMD / SSE basics: What are the 128-bit to 512-bit registers used for? with links to many examples.

SSE/SIMD vector programming guides, focused on the SIMD aspect rather than general x86:

Agner Fog's Optimizing Assembly guide has a chapter on vectors, including tables of data movement instructions: broadcasts within a vector, combine data between two vectors, different kinds of shuffles, etc. It's great for finding the right instruction (on intrinsic) for the data movement you need.
Crunching Numbers with AVX and AVX2: An intro with examples of using C++ intrinsics
Slides + text: SIMD at Insomniac Games (GDC 2015): intro to SIMD, and some specific examples: checking all doors against all characters in a level. Advanced tricks: Filtering an array into a smaller array (using Left-packing based on a compare mask), with an SSSE3 pshufb solution and an SSE2 move-distance solution. Also: generating N-bit masks for variable-per-element N. Including a clever float-exponent based SSE2 version.

Instruction-set / intrinsics reference guides (see the x86 tag wiki for more links)

Intel's vector intrinsics finder/search (very good): search by asm mnemonic or C intrinsic name. Filter by type and/or by instruction-set extension family (e.g. exclude AVX512 and later). Occasionally buggy, esp. the performance info. (Look at Agner Fog's tables for performance info, although it has occasional errors or typos, too).
Intel's manuals, including instruction set reference manual. Very detailed description of what every instruction does, using pseudo-code. These manuals are accurate much more often than the intrinsics guide.
x86/x64 SIMD Instruction List (SSE to AVX512) Beta: A nice compact table listing instruction mnemonics and their intrinsics, broken down by type and element-size. Detailed pages with graphical data-movement diagrams for each instruction.

Miscellaneous specific things:

Shuffling by mask with Intel AVX explains how shuffle-control vectors and _MM_SHUFFLE work, including in-lane vs. lane-crossing for AVX.
SSE interleave/merge/combine 2 vectors using a mask, per-element conditional move? Blends, especially variable blends (blendvps)
What are the best instruction sequences to generate vector constants on the fly?. In C/C++, almost always prefer _mm_set or _mm_set1 to initialize local variables (not globals), rather than defining arrays and loading from them.
print a __m128i variable: How to safely and portably access the elements of a vector, and how to debug-print them.

Streaming SIMD Extensions (SSE) basics

Together, the various SSE extensions allow working with 128b vectors of float, double, or integer (from 8b to 64b) elements. There are instructions for arithmetic, bitwise operations, shuffles, blends (conditional moves), compares, and some more-specialized operations (e.g. SAD for multimedia, carryless-multiply for crypto/finite-field math, strings (for strstr() and so on)). FP sqrt is provided, but unlike the x87 FPU, math library functions like sin must be implemented by software. SSE for scalar FP math has replaced x87 floating point, now that hardware support is near-universal.

Efficient use usually requires programs to store their data in contiguous chunks, so it can be loaded in chunks of 16B and used without too much shuffling. SSE doesn't offer loads / stores with a stride, only packed. (SoA vs. AoS: structs-of-arrays vs. arrays-of-structs). Alignment requirements on memory operands can also be a hurdle, even though modern hardware has fast unaligned loads/stores.

While there are many instructions available, the instruction set is not very orthogonal. It's not uncommon to find the operation you need, but only available for elements of a different size than you're working with. Another good example is that floating point shuffles (SHUFPS) have different semantics than 32b-integer shuffles (PSHUFD).

Details

SSE added new architectural registers (xmm0-xmm7, 128b each (xmm0-xmm15 in 64bit mode)), requiring OS support to save/restore them on context switches. The previous MMX extensions (for integer SIMD) reused the x87 FP registers.

Intel introduced MMX, original-SSE, SSE2, SSE3, SSSE3, SSE4.1, and SSE4.2. AMD's XOP (a revision of their SSE5 plans) was never picked up by Intel, and will be dropped even by future AMD designs. The instruction-set war between Intel and AMD has led to many sub-optimal results, which makes instruction decoders in CPUs require more power and transistors. (And limits opportunity for further extensions).

"SSE" commonly refers to the whole family of extensions. Writing programs that make sure to only use instructions supported by the machine they run on is necessary, and implied, and not worth cluttering our language with. (Setting function pointers is a good way to detect what's supported once at startup, avoiding a branch to select an appropriate function every time one is needed.)

Further SSE extensions are not expected: AVX introduced a new 3-operand version of all SSE instructions, as well as some new features (including dropping alignment requirements, except for explicitly-aligned moves like vmovdqa). Further vector extensions will be called AVX-something, until Intel comes up with something different enough to change the name again.

History

SSE, first introduced with the Pentium III in 1999, was Intel's reply to AMD's 3DNow extension in 1998.

The original-SSE added vector single-precision floating point math. Integer instructions to operate on xmm registers (instead of 64bit mmx regs) didn't appear until SSE2.

Original-SSE can be considered somewhat half-hearted insofar as it only covered the most basic operations and suffered from severe limitations both in functionality and performance, making it mostly useful for a few select applications, such as audio or raster image processing.

Most of SSE's limitations have been ameliorated with the SSE2 instruction set, the only notable limitation remaining to date is the lack of horizontal addition or a dot product operation in both an efficient way and widely available. While SSE3 and SSE4.1 added horizontal add and dot product instructions, they're usually slower than manual shuffle+add. Only use them at the end of a loop.

The lack of cross-manufacturer support made software development with SSE a challenge during the initial years. With AMD's adoption of SSE2 into its 64bit processors during 2003/2004, this problem gradually disappeared. As of today, there exist virtually no processors without SSE/SSE2 support. SSE2 is part of x86-64 baseline, with twice as many vector registers available in 64bit mode.

2314 questions

votes

5 answers

best cross-platform method to get aligned memory

Here is the code I normally use to get aligned memory with Visual Studio and GCC inline void* aligned_malloc(size_t size, size_t align) { void *result; #ifdef _MSC_VER result = _aligned_malloc(size, align); #else …

c++ c performance sse memory-alignment

asked May 04 '13 at 17:17

user2088790

votes

4 answers

print a __m128i variable

I'm trying to learn to code using intrinsics and below is a code which does addition compiler used: icc #include #include int main() { __m128i a = _mm_set_epi32(1,2,3,4); __m128i b = _mm_set_epi32(1,2,3,4); …

c assembly sse simd intrinsics

asked Nov 06 '12 at 18:34

arunmoezhi

3,082
6
35
54

votes

15 answers

Using SSE instructions

I have a loop written in C++ which is executed for each element of a big integer array. Inside the loop, I mask some bits of the integer and then find the min and max values. I heard that if I use SSE instructions for these operations it will run…

c++ optimization assembly processor sse

asked Feb 25 '09 at 15:55

Naveen

74,600
47
176
233

votes

2 answers

How to implement atoi using SIMD?

I'd like to try writing an atoi implementation using SIMD instructions, to be included in RapidJSON (a C++ JSON reader/writer library). It currently has some SSE2 and SSE4.2 optimizations in other places. If it's a speed gain, multiple atoi results…

c++ x86 sse simd atoi

asked Feb 01 '16 at 09:33

the_drow

18,571
25
126
193

votes

1 answer

Why is my hand-tuned, SSE-enabled code so slow?

Long story short: I'm developing a computing-intensive image processing application in C++. It needs to calculate many variants of image warps on small blocks of pixels extracted from larger images. The program doesn't run as fast as I would like.…

c++ optimization opencv sse

asked Mar 30 '13 at 22:41

neuviemeporte

6,310
10
49
78

votes

1 answer

Is my understanding of AoS vs SoA advantages/disadvantages correct?

I've recently been reading about AoS vs SoA structure design and data-oriented design. It's oddly difficult to find information about either, and what I have found seems to assume greater understanding of processor functionality than I possess. That…

caching memory sse simd data-oriented-design

asked Oct 20 '16 at 20:14

P...

votes

5 answers

Get member of __m128 by index?

I've got some code, originally given to me by someone working with MSVC, and I'm trying to get it to work on Clang. Here's the function that I'm having trouble with: float vectorGetByIndex( __m128 V, unsigned int i ) { assert( i <= 3 ); …

c++ clang sse simd intrinsics

asked Sep 27 '12 at 15:06

benwad

6,414
10
59
93

votes

5 answers

Benefits of x87 over SSE

I know that x87 has higher internal precision, which is probably the biggest difference that people see between it and SSE operations. But I have to wonder, is there any other benefit to using x87? I have a habit of typing -mfpmath=sse…

x86 x86-64 sse fpu x87

asked Dec 04 '09 at 03:33

Tom

10,689
4
41
50

votes

2 answers

Newton Raphson with SSE2 - can someone explain me these 3 lines

I'm reading this document: http://software.intel.com/en-us/articles/interactive-ray-tracing and I stumbled upon these three lines of code: The SIMD version is already quite a bit faster, but we can do better. Intel has added a fast 1/sqrt(x)…

c++ c math sse newtons-method

asked Feb 07 '13 at 13:34

Marco A.

43,032
26
132
246

votes

3 answers

SSE, intrinsics, and alignment

I've written a 3D vector class using a lot of SSE compiler intrinsics. Everything worked fine until I started to instatiate classes having the 3D vector as a member with new. I experienced odd crashes in release mode but not in debug mode and the…

c++ alignment sse intrinsics

asked Sep 19 '12 at 20:06

Jan Deinhard

19,645
24
81
137

votes

3 answers

How to efficiently perform double/int64 conversions with SSE/AVX?

SSE2 has instructions for converting vectors between single-precision floats and 32-bit integers. _mm_cvtps_epi32() _mm_cvtepi32_ps() But there are no equivalents for double-precision and 64-bit integers. In other words, they are…

c++ floating-point sse simd avx

asked Dec 14 '16 at 14:09

plasmacel

8,183
7
53
101

votes

2 answers

SIMD and difference between packed and scalar double precision

I am reading Intel's intrinsics guide while implementing SIMD support. I have a few confusions and my questions are as below. __m128 _mm_cmpeq_ps (__m128 a, __m128 b) documentation says it is used to compare packed single precision floating points.…

c++ x86 sse simd intrinsics

asked Apr 25 '13 at 15:20

user1461001

votes

3 answers

SSE (SIMD): multiply vector by scalar

A common operation I do in my program is scaling vectors by a scalar (V*s, e.g. [1,2,3,4]*2 == [2,4,6,8]). Is there a SSE (or AVX) instruction to do this, other than first loading the scalar in every position in a vector (e.g. _mm_set_ps(2,2,2,2))…

c x86 sse simd

asked Jan 31 '12 at 12:35

Hallgeir

1,213
1
14
29

votes

1 answer

Can long integer routines benefit from SSE?

I'm still working on routines for arbitrary long integers in C++. So far, I have implemented addition/subtraction and multiplication for 64-bit Intel CPUs. Everything works fine, but I wondered if I can speed it a bit by using SSE. I browsed through…

performance integer sse bignum arbitrary-precision

asked Jan 15 '12 at 01:54

cxxl

4,939
3
31
52

votes

4 answers

How to move 128-bit immediates to XMM registers

There already is a question on this, but it was closed as "ambiguous" so I'm opening a new one - I've found the answer, maybe it will help others too. The question is: how do you write a sequence of assembly code to initialize an XMM register with a…

assembly x86 sse simd

asked Jul 11 '11 at 17:38

Virgil

3,022
2
19
36

Prev 1 2

…

99 100 Next