Single instruction, multiple data (SIMD) is the concept of having each instruction operate on a small chunk or vector of data elements. CPU vector instruction sets include: x86 SSE and AVX, ARM NEON, and PowerPC AltiVec. To efficiently use SIMD instructions, data needs to be in structure-of-arrays form and should occur in longer streams. Naively "SIMD optimized" code frequently surprises by running slower than the original.
Questions tagged [simd]
2540 questions
16
votes
6 answers
Fastest Implementation of the Natural Exponential Function Using SSE
I'm looking for an approximation of the natural exponential function operating on SSE element. Namely - __m128 exp( __m128 x ).
I have an implementation which is quick but seems to be very low in accuracy:
static inline __m128 FastExpSse(__m128…

Royi
- 4,640
- 6
- 46
- 64
16
votes
4 answers
How do I perform a bitwise NOT in SSE/AVX?
Is it my imagination, or is a PNOT instruction missing from SSE and AVX? That is, an instruction which flips every bit in the vector.
If yes, is there a better way of emulating it than PXOR with a vector of all 1s? Quite annoying since I need to set…

SODIMM
- 303
- 2
- 12
16
votes
4 answers
How to Calculate single-vector Dot Product using SSE intrinsic functions in C
I am trying to multiply two vectors together where each element of one vector is multiplied by the element in the same index at the other vector. I then want to sum all the elements of the resulting vector to obtain one number. For instance, the…

Sam
- 417
- 1
- 6
- 13
16
votes
1 answer
Under what conditions does the .NET JIT compiler perform automatic vectorization?
Does the new RyuJIT compiler ever generate vector (SIMD) CPU instructions, and when?
Side note: The System.Numerics namespace contains types that allow explicit use of Vector operations which may or may not generate SIMD instructions depending on…

redcalx
- 8,177
- 4
- 56
- 105
16
votes
2 answers
How can I try out SIMD instructions in Chrome?
I would like to experiment with SIMD (single instruction multiple data). From what I can glean from Google Group postings, people have been working to add this to Google Chrome, but when I try to call SIMD.Float32x4 in Chrome 46, I get that SIMD is…

bruceceng
- 1,844
- 18
- 23
16
votes
4 answers
GCC C vector extension: How to check if result of ANY element-wise comparison is true, and which?
I am new to GCC's C vector extensions. According to the manual, the result of comparing one vector to another in the form (test = vec1 > vec2;) is that "test" contains a 0 in each element that is false and a -1 in each element that is true.
But how…

user1649948
- 651
- 4
- 12
16
votes
1 answer
Fast vectorized rsqrt and reciprocal with SSE/AVX depending on precision
Suppose that it is necessary to compute reciprocal or reciprocal square root for packed floating point data. Both can easily be done by:
__m128 recip_float4_ieee(__m128 x) { return _mm_div_ps(_mm_set1_ps(1.0f), x); }
__m128 rsqrt_float4_ieee(__m128…

stgatilov
- 5,333
- 31
- 54
16
votes
1 answer
Branch and predicated instructions
Section 5.4.2 of the CUDA C Programming Guide states that branch divergence is handled either by "branch instructions" or, under certain conditions, "predicated instructions". I don't understand the difference between the two, and why one leads to…

lodhb
- 929
- 2
- 12
- 29
16
votes
1 answer
How do I gain measurable benefit from prefetch intrinsics?
Using gcc 4.4.5 (yeah... I know it's old) on x86_64. Limited to SSE2 (or earlier) instructions for compatibility reasons.
I have what I think should be a textbook case for gaining big benefits from prefetching. I have an array ("A") of 32-bit…

Marty
- 435
- 5
- 16
16
votes
2 answers
Common SIMD techniques
Where can I find information about common SIMD tricks? I have an instruction set and know, how to write non-tricky SIMD code, but I know, SIMD now is much more powerful. It can hold complex conditional branchless code.
For example (ARMv6), the…

zxcat
- 2,054
- 3
- 26
- 40
16
votes
4 answers
Methods to vectorise histogram in SIMD?
I am trying to implement histogram in Neon. Is it possible to vectorise ?

Rugger
- 373
- 3
- 10
15
votes
2 answers
Number of Compute Units corresponding to the number of work groups
I need some clarification. I'm developing OpenCL on my laptop running a small nvidia GPU (310M). When I query the device for CL_DEVICE_MAX_COMPUTE_UNITS, the result is 2. I read the number of work groups for running a kernel should correspond to the…

rdoubleui
- 3,554
- 4
- 30
- 51
15
votes
1 answer
Get index of first element that is not zero in a __m256 variable
__m256 dst = _mm256_cmp_ps(value1, value2, _CMP_LE_OQ);
If dst is [0,0,0,-nan, 0,0,0,-nan];
I want to be able to know the first -nan index, in this case 3 without doing a for loop with 8 iterations.
Is this possible?

hidayat
- 9,493
- 13
- 51
- 66
15
votes
4 answers
Intel SSE: Why does `_mm_extract_ps` return `int` instead of `float`?
Why does _mm_extract_ps return an int instead of a float?
What's the proper way to read a single float from an XMM register in C?
Or rather, a different way to ask it is: What's the opposite of the _mm_set_ps instruction?

user541686
- 205,094
- 128
- 528
- 886
15
votes
5 answers
Does rewriting memcpy/memcmp/... with SIMD instructions make sense?
Does rewriting memcpy/memcmp/... with SIMD instructions make sense in a large scale software?
If so, why doesn't GCC generate SIMD instructions for these library functions by default?
Also, are there any other functions can be possibly improved by…

limi
- 695
- 1
- 8
- 18