Single instruction, multiple data (SIMD) is the concept of having each instruction operate on a small chunk or vector of data elements. CPU vector instruction sets include: x86 SSE and AVX, ARM NEON, and PowerPC AltiVec. To efficiently use SIMD instructions, data needs to be in structure-of-arrays form and should occur in longer streams. Naively "SIMD optimized" code frequently surprises by running slower than the original.
Questions tagged [simd]
2540 questions
15
votes
1 answer
What is the diffrence between SPMD and SIMD?
I just cant understand whats the diffrence between them...
is SPMD is in the programming level and SIMD in the hardware level ?
example would be good !
thanks

RanZilber
- 1,840
- 4
- 31
- 42
15
votes
3 answers
Why is this SIMD multiplication not faster than non-SIMD multiplication?
Let's assume that we have a function that multiplies two arrays of 1000000 doubles each. In C/C++ the function looks like this:
void mul_c(double* a, double* b)
{
for (int i = 0; i != 1000000; ++i)
{
a[i] = a[i] * b[i];
}
}
The…

fighting_falcon93
- 405
- 5
- 14
15
votes
1 answer
Is it possible to use SIMD instructions in Rust?
In C/C++, you can use intrinsics for SIMD (such as AVX and AVX2) instructions. Is there a way to use SIMD in Rust?

pythonic
- 20,589
- 43
- 136
- 219
15
votes
3 answers
Can I use the AVX FMA units to do bit-exact 52 bit integer multiplications?
AXV2 doesn't have any integer multiplications with sources larger than 32-bit. It does offer 32 x 32 -> 32 multiplies, as well as 32 x 32 -> 64 multiplies1, but nothing with 64-bit sources.
Let's say I need an unsigned multiply with inputs larger…

BeeOnRope
- 60,350
- 16
- 207
- 386
15
votes
2 answers
SIMD instructions for floating point equality comparison (with NaN == NaN)
Which instructions would be used for comparing two 128 bit vectors consisting of 4 * 32-bit floating point values?
Is there an instruction that considers a NaN value on both sides as equal? If not, how big would the performance impact of a…

CodesInChaos
- 106,488
- 23
- 218
- 262
15
votes
1 answer
RyuJIT not making full use of SIMD intrinsics
I'm running some C# code that uses System.Numerics.Vector but as far as I can tell I'm not getting the full benefit of SIMD intrinsics. I'm using Visual Studio Community 2015 with Update 1, and my clrjit.dll is v4.6.1063.1.
I'm running on an…

eoinmullan
- 1,157
- 1
- 9
- 32
15
votes
3 answers
How to use the multiply and accumulate intrinsics in ARM Cortex-a8?
how to use the Multiply-Accumulate intrinsics provided by GCC?
float32x4_t vmlaq_f32 (float32x4_t , float32x4_t , float32x4_t);
Can anyone explain what three parameters I have to pass to this function. I mean the Source and destination registers…

HaggarTheHorrible
- 7,083
- 20
- 70
- 81
15
votes
5 answers
Taking advantage of SSE and other CPU extensions
Theres are couple of places in my code base where the same operation is repeated a very large number of times for a large data set. In some cases it's taking a considerable time to process these.
I believe that using SSE to implement these loops…

Fire Lancer
- 29,364
- 31
- 116
- 182
15
votes
1 answer
Beating or meeting OS X memset (and memset_pattern4)
My question is based on another SO question: Why does _mm_stream_ps produce L1/LL cache misses?
After reading it and being intrigued by it, I tried to replicate the results and see for myself which was faster: naive loop, unrolled naive loop,…

Aktau
- 1,847
- 21
- 30
15
votes
1 answer
Shift a __m128i of n bits
I have a __m128i variable and I need to shift its 128 bit value of n bits, i.e. like _mm_srli_si128 and _mm_slli_si128 work, but on bits instead of bytes. What is the most efficient way of doing this?

Filippo Bistaffa
- 551
- 3
- 16
15
votes
3 answers
Load address calculation when using AVX2 gather instructions
Looking at the AVX2 intrinsics documentation there are gathered load instructions such as VPGATHERDD:
__m128i _mm_i32gather_epi32 (int const * base, __m128i index, const int scale);
What isn't clear to me from the documentation is whether the…

Paul R
- 208,748
- 37
- 389
- 560
15
votes
2 answers
SIMD math libraries for SSE and AVX
I am looking for SIMD math libraries (preferably open source) for SSE and AVX. I mean for example if I have a AVX register v with 8 float values I want sin(v) to return the sin of all eight values at once.
AMD has a propreitery library, LibM…
user2088790
15
votes
3 answers
Sum reduction of unsigned bytes without overflow, using SSE2 on Intel
I am trying to find sum reduction of 32 elements (each 1 byte data) on an Intel i3 processor. I did this:
s=0;
for (i=0; i<32; i++)
{
s = s + a[i];
}
However, its taking more time, since my application is a real-time application requiring…

gpuguy
- 4,607
- 17
- 67
- 125
14
votes
7 answers
Fastest way to compute distance squared
My code relies heavily on computing distances between two points in 3D space.
To avoid the expensive square root I use the squared distance throughout.
But still it takes up a major fraction of the computing time and I would like to replace my…

Pim Schellart
- 715
- 1
- 6
- 18
14
votes
9 answers
How to quickly count bits into separate bins in a series of ints on Sandy Bridge?
Update: Please read the code, it is NOT about counting bits in one int
Is it possible to improve performance of the following code with some clever assembler?
uint bit_counter[64];
void Count(uint64 bits) {
bit_counter[0] += (bits >> 0) & 1;
…

Łukasz Lew
- 48,526
- 41
- 139
- 208