Questions tagged [simd]

Single instruction, multiple data (SIMD) is the concept of having each instruction operate on a small chunk or vector of data elements. CPU vector instruction sets include: x86 SSE and AVX, ARM NEON, and PowerPC AltiVec. To efficiently use SIMD instructions, data needs to be in structure-of-arrays form and should occur in longer streams. Naively "SIMD optimized" code frequently surprises by running slower than the original.

2540 questions

votes

1 answer

What is the diffrence between SPMD and SIMD?

I just cant understand whats the diffrence between them... is SPMD is in the programming level and SIMD in the hardware level ? example would be good ! thanks

terminology parallel-processing simd

asked Feb 16 '11 at 08:47

RanZilber

1,840
4
31
42

votes

3 answers

Why is this SIMD multiplication not faster than non-SIMD multiplication?

Let's assume that we have a function that multiplies two arrays of 1000000 doubles each. In C/C++ the function looks like this: void mul_c(double* a, double* b) { for (int i = 0; i != 1000000; ++i) { a[i] = a[i] * b[i]; } } The…

c++ performance assembly x86 simd

asked Mar 22 '17 at 23:56

fighting_falcon93

votes

1 answer

Is it possible to use SIMD instructions in Rust?

In C/C++, you can use intrinsics for SIMD (such as AVX and AVX2) instructions. Is there a way to use SIMD in Rust?

rust simd avx avx2

asked Mar 21 '17 at 21:58

pythonic

20,589
43
136
219

votes

3 answers

Can I use the AVX FMA units to do bit-exact 52 bit integer multiplications?

AXV2 doesn't have any integer multiplications with sources larger than 32-bit. It does offer 32 x 32 -> 32 multiplies, as well as 32 x 32 -> 64 multiplies1, but nothing with 64-bit sources. Let's say I need an unsigned multiply with inputs larger…

floating-point x86 simd avx2 fma

asked Dec 30 '16 at 22:54

BeeOnRope

60,350
16
207
386

votes

2 answers

SIMD instructions for floating point equality comparison (with NaN == NaN)

Which instructions would be used for comparing two 128 bit vectors consisting of 4 * 32-bit floating point values? Is there an instruction that considers a NaN value on both sides as equal? If not, how big would the performance impact of a…

assembly floating-point x86 x86-64 simd

asked Jan 22 '16 at 16:41

CodesInChaos

106,488
23
218
262

votes

1 answer

RyuJIT not making full use of SIMD intrinsics

I'm running some C# code that uses System.Numerics.Vector but as far as I can tell I'm not getting the full benefit of SIMD intrinsics. I'm using Visual Studio Community 2015 with Update 1, and my clrjit.dll is v4.6.1063.1. I'm running on an…

c# sse simd avx ryujit

asked Jan 20 '16 at 10:14

eoinmullan

1,157
1
9
32

votes

3 answers

How to use the multiply and accumulate intrinsics in ARM Cortex-a8?

how to use the Multiply-Accumulate intrinsics provided by GCC? float32x4_t vmlaq_f32 (float32x4_t , float32x4_t , float32x4_t); Can anyone explain what three parameters I have to pass to this function. I mean the Source and destination registers…

c arm simd intrinsics neon

asked Jul 13 '10 at 18:56

HaggarTheHorrible

7,083
20
70
81

votes

5 answers

Taking advantage of SSE and other CPU extensions

Theres are couple of places in my code base where the same operation is repeated a very large number of times for a large data set. In some cases it's taking a considerable time to process these. I believe that using SSE to implement these loops…

c++ gcc cross-platform visual-c++ simd

asked Dec 12 '09 at 19:30

Fire Lancer

29,364
31
116
182

votes

1 answer

Beating or meeting OS X memset (and memset_pattern4)

My question is based on another SO question: Why does _mm_stream_ps produce L1/LL cache misses? After reading it and being intrigued by it, I tried to replicate the results and see for myself which was faster: naive loop, unrolled naive loop,…

c performance optimization assembly simd

asked Sep 16 '13 at 08:27

Aktau

1,847
21
30

votes

1 answer

Shift a __m128i of n bits

I have a __m128i variable and I need to shift its 128 bit value of n bits, i.e. like _mm_srli_si128 and _mm_slli_si128 work, but on bits instead of bytes. What is the most efficient way of doing this?

c x86 sse simd sse2

asked Jul 12 '13 at 08:29

Filippo Bistaffa

votes

3 answers

Load address calculation when using AVX2 gather instructions

Looking at the AVX2 intrinsics documentation there are gathered load instructions such as VPGATHERDD: __m128i _mm_i32gather_epi32 (int const * base, __m128i index, const int scale); What isn't clear to me from the documentation is whether the…

x86 sse simd avx2

asked Apr 24 '13 at 13:34

Paul R

208,748
37
389
560

votes

2 answers

SIMD math libraries for SSE and AVX

I am looking for SIMD math libraries (preferably open source) for SSE and AVX. I mean for example if I have a AVX register v with 8 float values I want sin(v) to return the sin of all eight values at once. AMD has a propreitery library, LibM…

sse simd avx math.h

asked Mar 30 '13 at 22:04

user2088790

votes

3 answers

Sum reduction of unsigned bytes without overflow, using SSE2 on Intel

I am trying to find sum reduction of 32 elements (each 1 byte data) on an Intel i3 processor. I did this: s=0; for (i=0; i<32; i++) { s = s + a[i]; } However, its taking more time, since my application is a real-time application requiring…

x86 sse simd sse2 sse3

asked Jun 07 '12 at 13:13

gpuguy

4,607
17
67
125

votes

7 answers

Fastest way to compute distance squared

My code relies heavily on computing distances between two points in 3D space. To avoid the expensive square root I use the squared distance throughout. But still it takes up a major fraction of the computing time and I would like to replace my…

c optimization simd

asked Nov 10 '11 at 10:19

Pim Schellart

votes

9 answers

How to quickly count bits into separate bins in a series of ints on Sandy Bridge?

Update: Please read the code, it is NOT about counting bits in one int Is it possible to improve performance of the following code with some clever assembler? uint bit_counter[64]; void Count(uint64 bits) { bit_counter[0] += (bits >> 0) & 1; …

c++ assembly x86 simd avx

asked Oct 17 '11 at 12:51

Łukasz Lew

48,526
41
139
208

Prev 1 2 3

…

99 100 Next