Questions tagged [simd]

Single instruction, multiple data (SIMD) is the concept of having each instruction operate on a small chunk or vector of data elements. CPU vector instruction sets include: x86 SSE and AVX, ARM NEON, and PowerPC AltiVec. To efficiently use SIMD instructions, data needs to be in structure-of-arrays form and should occur in longer streams. Naively "SIMD optimized" code frequently surprises by running slower than the original.

2540 questions

votes

6 answers

Fastest Implementation of the Natural Exponential Function Using SSE

I'm looking for an approximation of the natural exponential function operating on SSE element. Namely - __m128 exp( __m128 x ). I have an implementation which is quick but seems to be very low in accuracy: static inline __m128 FastExpSse(__m128…

asked Oct 30 '17 at 22:48

Royi

4,640
6
46
64

votes

4 answers

How do I perform a bitwise NOT in SSE/AVX?

Is it my imagination, or is a PNOT instruction missing from SSE and AVX? That is, an instruction which flips every bit in the vector. If yes, is there a better way of emulating it than PXOR with a vector of all 1s? Quite annoying since I need to set…

x86 bit-manipulation simd sse avx

asked Mar 05 '17 at 20:50

SODIMM

votes

4 answers

How to Calculate single-vector Dot Product using SSE intrinsic functions in C

I am trying to multiply two vectors together where each element of one vector is multiplied by the element in the same index at the other vector. I then want to sum all the elements of the resulting vector to obtain one number. For instance, the…

c optimization vectorization sse simd

asked Nov 08 '10 at 01:26

Sam

votes

1 answer

Under what conditions does the .NET JIT compiler perform automatic vectorization?

Does the new RyuJIT compiler ever generate vector (SIMD) CPU instructions, and when? Side note: The System.Numerics namespace contains types that allow explicit use of Vector operations which may or may not generate SIMD instructions depending on…

.net vectorization simd auto-vectorization ryujit

asked Feb 20 '16 at 16:00

redcalx

8,177
4
56
105

votes

2 answers

How can I try out SIMD instructions in Chrome?

I would like to experiment with SIMD (single instruction multiple data). From what I can glean from Google Group postings, people have been working to add this to Google Chrome, but when I try to call SIMD.Float32x4 in Chrome 46, I get that SIMD is…

javascript google-chrome 32bit-64bit simd

asked Oct 28 '15 at 21:48

bruceceng

1,844
18
23

votes

4 answers

GCC C vector extension: How to check if result of ANY element-wise comparison is true, and which?

I am new to GCC's C vector extensions. According to the manual, the result of comparing one vector to another in the form (test = vec1 > vec2;) is that "test" contains a 0 in each element that is false and a -1 in each element that is true. But how…

c gcc comparison vectorization simd

asked Jul 23 '15 at 20:20

user1649948

votes

1 answer

Fast vectorized rsqrt and reciprocal with SSE/AVX depending on precision

Suppose that it is necessary to compute reciprocal or reciprocal square root for packed floating point data. Both can easily be done by: __m128 recip_float4_ieee(__m128 x) { return _mm_div_ps(_mm_set1_ps(1.0f), x); } __m128 rsqrt_float4_ieee(__m128…

performance sse simd avx

asked Jul 22 '15 at 06:15

stgatilov

5,333
31
54

votes

1 answer

Branch and predicated instructions

Section 5.4.2 of the CUDA C Programming Guide states that branch divergence is handled either by "branch instructions" or, under certain conditions, "predicated instructions". I don't understand the difference between the two, and why one leads to…

cuda simd

asked May 17 '15 at 15:37

lodhb

votes

1 answer

How do I gain measurable benefit from prefetch intrinsics?

Using gcc 4.4.5 (yeah... I know it's old) on x86_64. Limited to SSE2 (or earlier) instructions for compatibility reasons. I have what I think should be a textbook case for gaining big benefits from prefetching. I have an array ("A") of 32-bit…

performance x86-64 sse simd prefetch

asked Sep 20 '14 at 08:57

Marty

votes

2 answers

Common SIMD techniques

Where can I find information about common SIMD tricks? I have an instruction set and know, how to write non-tricky SIMD code, but I know, SIMD now is much more powerful. It can hold complex conditional branchless code. For example (ARMv6), the…

arm sse simd neon mmx

asked Jan 28 '10 at 17:04

zxcat

2,054
3
26
40

votes

4 answers

Methods to vectorise histogram in SIMD?

I am trying to implement histogram in Neon. Is it possible to vectorise ?

image-processing arm histogram simd neon

asked Oct 20 '12 at 06:38

Rugger

votes

2 answers

Number of Compute Units corresponding to the number of work groups

I need some clarification. I'm developing OpenCL on my laptop running a small nvidia GPU (310M). When I query the device for CL_DEVICE_MAX_COMPUTE_UNITS, the result is 2. I read the number of work groups for running a kernel should correspond to the…

opencl nvidia simd

asked Feb 17 '12 at 10:17

rdoubleui

3,554
4
30
51

votes

1 answer

Get index of first element that is not zero in a __m256 variable

__m256 dst = _mm256_cmp_ps(value1, value2, _CMP_LE_OQ); If dst is [0,0,0,-nan, 0,0,0,-nan]; I want to be able to know the first -nan index, in this case 3 without doing a for loop with 8 iterations. Is this possible?

c++ c sse simd avx

asked Mar 31 '19 at 09:40

hidayat

9,493
13
51
66

votes

4 answers

Intel SSE: Why does `_mm_extract_ps` return `int` instead of `float`?

Why does _mm_extract_ps return an int instead of a float? What's the proper way to read a single float from an XMM register in C? Or rather, a different way to ask it is: What's the opposite of the _mm_set_ps instruction?

c sse simd

asked Apr 02 '11 at 23:52

user541686

205,094
128
528
886

votes

5 answers

Does rewriting memcpy/memcmp/... with SIMD instructions make sense?

Does rewriting memcpy/memcmp/... with SIMD instructions make sense in a large scale software? If so, why doesn't GCC generate SIMD instructions for these library functions by default? Also, are there any other functions can be possibly improved by…

performance sse simd

asked Mar 16 '11 at 05:21

limi

Prev 1 2 3

…

99 100 Next