Questions tagged [simd]

Single instruction, multiple data (SIMD) is the concept of having each instruction operate on a small chunk or vector of data elements. CPU vector instruction sets include: x86 SSE and AVX, ARM NEON, and PowerPC AltiVec. To efficiently use SIMD instructions, data needs to be in structure-of-arrays form and should occur in longer streams. Naively "SIMD optimized" code frequently surprises by running slower than the original.

2540 questions
16
votes
6 answers

Fastest Implementation of the Natural Exponential Function Using SSE

I'm looking for an approximation of the natural exponential function operating on SSE element. Namely - __m128 exp( __m128 x ). I have an implementation which is quick but seems to be very low in accuracy: static inline __m128 FastExpSse(__m128…
Royi
  • 4,640
  • 6
  • 46
  • 64
16
votes
4 answers

How do I perform a bitwise NOT in SSE/AVX?

Is it my imagination, or is a PNOT instruction missing from SSE and AVX? That is, an instruction which flips every bit in the vector. If yes, is there a better way of emulating it than PXOR with a vector of all 1s? Quite annoying since I need to set…
SODIMM
  • 303
  • 2
  • 12
16
votes
4 answers

How to Calculate single-vector Dot Product using SSE intrinsic functions in C

I am trying to multiply two vectors together where each element of one vector is multiplied by the element in the same index at the other vector. I then want to sum all the elements of the resulting vector to obtain one number. For instance, the…
Sam
  • 417
  • 1
  • 6
  • 13
16
votes
1 answer

Under what conditions does the .NET JIT compiler perform automatic vectorization?

Does the new RyuJIT compiler ever generate vector (SIMD) CPU instructions, and when? Side note: The System.Numerics namespace contains types that allow explicit use of Vector operations which may or may not generate SIMD instructions depending on…
redcalx
  • 8,177
  • 4
  • 56
  • 105
16
votes
2 answers

How can I try out SIMD instructions in Chrome?

I would like to experiment with SIMD (single instruction multiple data). From what I can glean from Google Group postings, people have been working to add this to Google Chrome, but when I try to call SIMD.Float32x4 in Chrome 46, I get that SIMD is…
bruceceng
  • 1,844
  • 18
  • 23
16
votes
4 answers

GCC C vector extension: How to check if result of ANY element-wise comparison is true, and which?

I am new to GCC's C vector extensions. According to the manual, the result of comparing one vector to another in the form (test = vec1 > vec2;) is that "test" contains a 0 in each element that is false and a -1 in each element that is true. But how…
user1649948
  • 651
  • 4
  • 12
16
votes
1 answer

Fast vectorized rsqrt and reciprocal with SSE/AVX depending on precision

Suppose that it is necessary to compute reciprocal or reciprocal square root for packed floating point data. Both can easily be done by: __m128 recip_float4_ieee(__m128 x) { return _mm_div_ps(_mm_set1_ps(1.0f), x); } __m128 rsqrt_float4_ieee(__m128…
stgatilov
  • 5,333
  • 31
  • 54
16
votes
1 answer

Branch and predicated instructions

Section 5.4.2 of the CUDA C Programming Guide states that branch divergence is handled either by "branch instructions" or, under certain conditions, "predicated instructions". I don't understand the difference between the two, and why one leads to…
lodhb
  • 929
  • 2
  • 12
  • 29
16
votes
1 answer

How do I gain measurable benefit from prefetch intrinsics?

Using gcc 4.4.5 (yeah... I know it's old) on x86_64. Limited to SSE2 (or earlier) instructions for compatibility reasons. I have what I think should be a textbook case for gaining big benefits from prefetching. I have an array ("A") of 32-bit…
Marty
  • 435
  • 5
  • 16
16
votes
2 answers

Common SIMD techniques

Where can I find information about common SIMD tricks? I have an instruction set and know, how to write non-tricky SIMD code, but I know, SIMD now is much more powerful. It can hold complex conditional branchless code. For example (ARMv6), the…
zxcat
  • 2,054
  • 3
  • 26
  • 40
16
votes
4 answers

Methods to vectorise histogram in SIMD?

I am trying to implement histogram in Neon. Is it possible to vectorise ?
Rugger
  • 373
  • 3
  • 10
15
votes
2 answers

Number of Compute Units corresponding to the number of work groups

I need some clarification. I'm developing OpenCL on my laptop running a small nvidia GPU (310M). When I query the device for CL_DEVICE_MAX_COMPUTE_UNITS, the result is 2. I read the number of work groups for running a kernel should correspond to the…
rdoubleui
  • 3,554
  • 4
  • 30
  • 51
15
votes
1 answer

Get index of first element that is not zero in a __m256 variable

__m256 dst = _mm256_cmp_ps(value1, value2, _CMP_LE_OQ); If dst is [0,0,0,-nan, 0,0,0,-nan]; I want to be able to know the first -nan index, in this case 3 without doing a for loop with 8 iterations. Is this possible?
hidayat
  • 9,493
  • 13
  • 51
  • 66
15
votes
4 answers

Intel SSE: Why does `_mm_extract_ps` return `int` instead of `float`?

Why does _mm_extract_ps return an int instead of a float? What's the proper way to read a single float from an XMM register in C? Or rather, a different way to ask it is: What's the opposite of the _mm_set_ps instruction?
user541686
  • 205,094
  • 128
  • 528
  • 886
15
votes
5 answers

Does rewriting memcpy/memcmp/... with SIMD instructions make sense?

Does rewriting memcpy/memcmp/... with SIMD instructions make sense in a large scale software? If so, why doesn't GCC generate SIMD instructions for these library functions by default? Also, are there any other functions can be possibly improved by…
limi
  • 695
  • 1
  • 8
  • 18