Single instruction, multiple data (SIMD) is the concept of having each instruction operate on a small chunk or vector of data elements. CPU vector instruction sets include: x86 SSE and AVX, ARM NEON, and PowerPC AltiVec. To efficiently use SIMD instructions, data needs to be in structure-of-arrays form and should occur in longer streams. Naively "SIMD optimized" code frequently surprises by running slower than the original.
Questions tagged [simd]
2540 questions
14
votes
3 answers
How can I disable vectorization while using GCC?
I am compiling my code using following command:
gcc -O3 -ftree-vectorizer-verbose=6 -msse4.1 -ffast-math
With this all the optimizations are enabled.
But I want to disable vectorization while keeping the other optimizations.

PhantomM
- 825
- 6
- 17
- 34
14
votes
5 answers
Count each bit-position separately over many 64-bit bitmasks, with AVX but not AVX2
(Related: How to quickly count bits into separate bins in a series of ints on Sandy Bridge? is an earlier duplicate of this, with some different answers. Editor's note: the answers here are probably better.
Also, an AVX2 version of a similar…

pktCoder
- 1,105
- 2
- 15
- 32
14
votes
2 answers
Why doesn't gcc resolve _mm256_loadu_pd as single vmovupd?
I'm writing some AVX code and I need to load from potentially unaligned memory. I'm currently loading 4 doubles, hence I would use intrinsic instruction _mm256_loadu_pd; the code I've written is:
__m256d d1 = _mm256_loadu_pd(vInOut + i*4);
I've…

Emanuele
- 1,408
- 1
- 15
- 39
14
votes
2 answers
Constexpr and SSE intrinsics
Most C++ compilers support SIMD(SSE/AVX) instructions with intrisics like
_mm_cmpeq_epi32
My problem with this is that this function is not marked as constexpr, although "semantically" there is no reason for this function to not be constexpr since…

NoSenseEtAl
- 28,205
- 28
- 128
- 277
14
votes
5 answers
Fastest Implementation of Exponential Function Using AVX
I'm looking for an efficient (Fast) approximation of the exponential function operating on AVX elements (Single Precision Floating Point). Namely - __m256 _mm256_exp_ps( __m256 x ) without SVML.
Relative Accuracy should be something like ~1e-6, or…

Royi
- 4,640
- 6
- 46
- 64
14
votes
2 answers
Why does the FMA _mm256_fmadd_pd() intrinsic have 3 asm mnemonics, "vfmadd132pd", "231" and "213"?
Could someone explain to me why there are 3 variants of the fused multiply-accumulate instruction: vfmadd132pd, vfmadd231pd and vfmadd213pd, while there is only one C intrinsics _mm256_fmadd_pd?
To make things simple, what is the difference between…

Zheyuan Li
- 71,365
- 17
- 180
- 248
14
votes
4 answers
Fast 24-bit array -> 32-bit array conversion?
Quick Summary:
I have an array of 24-bit values. Any suggestion on how to quickly expand the individual 24-bit array elements into 32-bit elements?
Details:
I'm processing incoming video frames in realtime using Pixel Shaders in DirectX 10. A…

Clippy
- 354
- 3
- 10
14
votes
2 answers
SIMD latency throughput
On the Intel Intrisics Guide for most instructions, it also has a value for both latency and throughput. Example:
__m128i _mm_min_epi32
Performance
Architecture Latency Throughput
Haswell 1 0.5
Ivy Bridge 1 0.5
Sandy Bridge 1 …

Alexandros
- 2,160
- 4
- 27
- 52
14
votes
2 answers
Are GPU/CUDA cores SIMD ones?
Let's take the nVidia Fermi Compute Architecture. It says:
The first Fermi based GPU, implemented with 3.0 billion transistors, features up to 512 CUDA cores. A CUDA core executes a floating point or integer instruction per clock for a thread. The…

Marc Andreson
- 3,405
- 5
- 35
- 51
14
votes
3 answers
Find index of maximum element in x86 SIMD vector
I'm thinking of implementing 8-ary heapsort for uint32_t's. To do this I need a function that selects the index of maximum element in a 8-element vector so that I can compare it with parent element and conditionally perform swap and further siftDown…

Wibowit
- 371
- 2
- 11
14
votes
6 answers
SIMD or not SIMD - cross platform
I need some idea how to write a C++ cross platform implementation of a few parallelizable problems in a way so I can take advantage of SIMD (SSE, SPU, etc) if available. As well as I want to be able at run time to switch between SIMD and not…

Aleks
- 1,177
- 10
- 21
14
votes
1 answer
Using __m256d registers
How do you use __m256d?
Say I want to use the Intel AVX instruction _mm256_add_pd on a simple Vector3 class with 3-64 bit double precision components (x, y, and z). What is the correct way to use this?
Since x, y and z are members of the Vector3…

bobobobo
- 64,917
- 62
- 258
- 363
14
votes
1 answer
SIMD the following code
How do I SIMIDize the following code in C (using SIMD intrinsics of course)? I am having trouble understanding SIMD intrinsics and this would help a lot:
int sum_naive( int n, int *a )
{
int sum = 0;
for( int i = 0; i < n; i++ )
sum…

user1585869
- 301
- 3
- 11
14
votes
1 answer
Look-Up Table using SIMD
I have a big pixel processing function which I am currently trying to optimize using intrinsic functions.
Being an SSE novice, I am not sure how to tackle the part of the code which involves lookup tables.
Basically, I am trying to vectorize the…

Rotem
- 21,452
- 6
- 62
- 109
13
votes
3 answers
SSE multiplication 16 x uint8_t
I want to multiply with SSE4 a __m128i object with 16 unsigned 8 bit integers, but I could only find an intrinsic for multiplying 16 bit integers. Is there nothing such as _mm_mult_epi8?

Roby
- 2,011
- 4
- 28
- 55