Questions tagged [simd]

Single instruction, multiple data (SIMD) is the concept of having each instruction operate on a small chunk or vector of data elements. CPU vector instruction sets include: x86 SSE and AVX, ARM NEON, and PowerPC AltiVec. To efficiently use SIMD instructions, data needs to be in structure-of-arrays form and should occur in longer streams. Naively "SIMD optimized" code frequently surprises by running slower than the original.

2540 questions
15
votes
1 answer

What is the diffrence between SPMD and SIMD?

I just cant understand whats the diffrence between them... is SPMD is in the programming level and SIMD in the hardware level ? example would be good ! thanks
RanZilber
  • 1,840
  • 4
  • 31
  • 42
15
votes
3 answers

Why is this SIMD multiplication not faster than non-SIMD multiplication?

Let's assume that we have a function that multiplies two arrays of 1000000 doubles each. In C/C++ the function looks like this: void mul_c(double* a, double* b) { for (int i = 0; i != 1000000; ++i) { a[i] = a[i] * b[i]; } } The…
15
votes
1 answer

Is it possible to use SIMD instructions in Rust?

In C/C++, you can use intrinsics for SIMD (such as AVX and AVX2) instructions. Is there a way to use SIMD in Rust?
pythonic
  • 20,589
  • 43
  • 136
  • 219
15
votes
3 answers

Can I use the AVX FMA units to do bit-exact 52 bit integer multiplications?

AXV2 doesn't have any integer multiplications with sources larger than 32-bit. It does offer 32 x 32 -> 32 multiplies, as well as 32 x 32 -> 64 multiplies1, but nothing with 64-bit sources. Let's say I need an unsigned multiply with inputs larger…
BeeOnRope
  • 60,350
  • 16
  • 207
  • 386
15
votes
2 answers

SIMD instructions for floating point equality comparison (with NaN == NaN)

Which instructions would be used for comparing two 128 bit vectors consisting of 4 * 32-bit floating point values? Is there an instruction that considers a NaN value on both sides as equal? If not, how big would the performance impact of a…
CodesInChaos
  • 106,488
  • 23
  • 218
  • 262
15
votes
1 answer

RyuJIT not making full use of SIMD intrinsics

I'm running some C# code that uses System.Numerics.Vector but as far as I can tell I'm not getting the full benefit of SIMD intrinsics. I'm using Visual Studio Community 2015 with Update 1, and my clrjit.dll is v4.6.1063.1. I'm running on an…
eoinmullan
  • 1,157
  • 1
  • 9
  • 32
15
votes
3 answers

How to use the multiply and accumulate intrinsics in ARM Cortex-a8?

how to use the Multiply-Accumulate intrinsics provided by GCC? float32x4_t vmlaq_f32 (float32x4_t , float32x4_t , float32x4_t); Can anyone explain what three parameters I have to pass to this function. I mean the Source and destination registers…
HaggarTheHorrible
  • 7,083
  • 20
  • 70
  • 81
15
votes
5 answers

Taking advantage of SSE and other CPU extensions

Theres are couple of places in my code base where the same operation is repeated a very large number of times for a large data set. In some cases it's taking a considerable time to process these. I believe that using SSE to implement these loops…
Fire Lancer
  • 29,364
  • 31
  • 116
  • 182
15
votes
1 answer

Beating or meeting OS X memset (and memset_pattern4)

My question is based on another SO question: Why does _mm_stream_ps produce L1/LL cache misses? After reading it and being intrigued by it, I tried to replicate the results and see for myself which was faster: naive loop, unrolled naive loop,…
Aktau
  • 1,847
  • 21
  • 30
15
votes
1 answer

Shift a __m128i of n bits

I have a __m128i variable and I need to shift its 128 bit value of n bits, i.e. like _mm_srli_si128 and _mm_slli_si128 work, but on bits instead of bytes. What is the most efficient way of doing this?
Filippo Bistaffa
  • 551
  • 3
  • 16
15
votes
3 answers

Load address calculation when using AVX2 gather instructions

Looking at the AVX2 intrinsics documentation there are gathered load instructions such as VPGATHERDD: __m128i _mm_i32gather_epi32 (int const * base, __m128i index, const int scale); What isn't clear to me from the documentation is whether the…
Paul R
  • 208,748
  • 37
  • 389
  • 560
15
votes
2 answers

SIMD math libraries for SSE and AVX

I am looking for SIMD math libraries (preferably open source) for SSE and AVX. I mean for example if I have a AVX register v with 8 float values I want sin(v) to return the sin of all eight values at once. AMD has a propreitery library, LibM…
user2088790
15
votes
3 answers

Sum reduction of unsigned bytes without overflow, using SSE2 on Intel

I am trying to find sum reduction of 32 elements (each 1 byte data) on an Intel i3 processor. I did this: s=0; for (i=0; i<32; i++) { s = s + a[i]; } However, its taking more time, since my application is a real-time application requiring…
gpuguy
  • 4,607
  • 17
  • 67
  • 125
14
votes
7 answers

Fastest way to compute distance squared

My code relies heavily on computing distances between two points in 3D space. To avoid the expensive square root I use the squared distance throughout. But still it takes up a major fraction of the computing time and I would like to replace my…
Pim Schellart
  • 715
  • 1
  • 6
  • 18
14
votes
9 answers

How to quickly count bits into separate bins in a series of ints on Sandy Bridge?

Update: Please read the code, it is NOT about counting bits in one int Is it possible to improve performance of the following code with some clever assembler? uint bit_counter[64]; void Count(uint64 bits) { bit_counter[0] += (bits >> 0) & 1; …
Łukasz Lew
  • 48,526
  • 41
  • 139
  • 208