Single instruction, multiple data (SIMD) is the concept of having each instruction operate on a small chunk or vector of data elements. CPU vector instruction sets include: x86 SSE and AVX, ARM NEON, and PowerPC AltiVec. To efficiently use SIMD instructions, data needs to be in structure-of-arrays form and should occur in longer streams. Naively "SIMD optimized" code frequently surprises by running slower than the original.
Questions tagged [simd]
2540 questions
24
votes
2 answers
inlining failed in call to always_inline ‘_mm_mullo_epi32’: target specific option mismatch
I am trying to compile a C program using cmake which uses SIMD intrinsics. When I try to compile it, I get two errors
/usr/lib/gcc/x86_64-linux-gnu/5/include/smmintrin.h:326:1: error: inlining failed in call to always_inline ‘_mm_mullo_epi32’:…

Lawan subba
- 610
- 3
- 7
- 19
24
votes
2 answers
Do all 64 bit intel architectures support SSSE3/SSE4.1/SSE4.2 instructions?
I did searched on web and intel Software manual . But am unable to confirm if all Intel 64 architectures support upto SSSE3 or upto SSE4.1 or upto SSE4.2 or AVX etc. So that I would be able to use minimum SIMD supported instructions in my programme.…

Vikram Dattu
- 801
- 3
- 8
- 24
23
votes
3 answers
Fastest way to do horizontal vector sum with AVX instructions
I have a packed vector of four 64-bit floating-point values.
I would like to get the sum of the vector's elements.
With SSE (and using 32-bit floats) I could just do the following:
v_sum = _mm_hadd_ps(v_sum, v_sum);
v_sum = _mm_hadd_ps(v_sum,…

Luigi Castelli
- 676
- 2
- 6
- 13
23
votes
2 answers
Does browser JavaScript allow for SIMD or Vectorized operations?
I want to write applications in JavaScript that require a large amount of numerical computation. However, I'm very confused about the state of efficient linear-algebra-like computation in client-side JavaScript. There seems to be many approaches,…

Seanny123
- 8,776
- 13
- 68
- 124
22
votes
5 answers
Optimizing Array Compaction
Let's say I have an array
k = [1 2 0 0 5 4 0]
I can compute a mask as follows
m = k > 0 = [1 1 0 0 1 1 0]
Using only the mask m and the following operations
Shift left / right
And/Or
Add/Subtract/Multiply
I can compact k into the following
[1 2 5…

jameszhao00
- 7,213
- 15
- 62
- 112
22
votes
8 answers
c++ SSE SIMD framework
Does anyone know an open-source C++ x86 SIMD intrinsics library?
Intel supplies exactly what I need in their integrated performance primitives library, but I can't use that because of the copyrights all over the place.
EDIT
I already know the…
user283145
22
votes
1 answer
Fastest way to compute absolute value using SSE
I am aware of 3 methods, but as far as I know, only the first 2 are generally used:
Mask off the sign bit using andps or andnotps.
Pros: One fast instruction if the mask is already in a register, which makes it perfect for doing this many times in…

Kumputer
- 588
- 1
- 6
- 22
22
votes
6 answers
How to use the Intel AVX in Java?
How do I use the Intel AVX vector instruction set from Java? It's a simple question but the answer seems to be hard to find.

Albert Hendriks
- 1,979
- 3
- 25
- 45
22
votes
5 answers
Transpose an 8x8 float using AVX/AVX2
Transposing a 8x8 matrix can be achieved by making four 4x4 matrices, and transposing each of them.
This is not want I'm going for.
In another question, one answer gave a solution that would only require 24 instructions for an 8x8 matrix. However,…

DavidS
- 1,660
- 1
- 12
- 26
22
votes
5 answers
How to combine two __m128 values to __m256?
I would like to combine two __m128 values to one __m256.
Something like this:
__m128 a = _mm_set_ps(1, 2, 3, 4);
__m128 b = _mm_set_ps(5, 6, 7, 8);
to something like:
__m256 c = { 1, 2, 3, 4, 5, 6, 7, 8 };
are there any intrinsics that I can…

user1468756
- 331
- 2
- 8
22
votes
5 answers
SIMD prefix sum on Intel cpu
I need to implement a prefix sum algorithm and would need it to be as fast as possible.
Ex:
[3, 1, 7, 0, 4, 1, 6, 3]
should give:
[3, 4, 11, 11, 15, 16, 22, 25]
Is there a way to do this using SSE SIMD CPU instruction?
My first idea is to…

skyde
- 2,816
- 4
- 34
- 53
21
votes
1 answer
IntStream leads to array elements being wrongly set to 0 (JVM Bug, Java 11)
In the class P below, the method test seems to return identically false:
import java.util.function.IntPredicate;
import java.util.stream.IntStream;
public class P implements IntPredicate {
private final static int SIZE = 33;
@Override
…

p_i
- 313
- 1
- 4
21
votes
2 answers
Modern approach to making std::vector allocate aligned memory
The following question is related, however answers are old, and comment from user Marc Glisse suggests there are new approaches since C++17 to this problem that might not be adequately discussed.
I'm trying to get aligned memory working properly for…

Prunus Persica
- 1,173
- 9
- 27
21
votes
2 answers
Choice between aligned vs. unaligned x86 SIMD instructions
There are generally two types of SIMD instructions:
A. Ones that work with aligned memory addresses, that will raise general-protection (#GP) exception if the address is not aligned on the operand size boundary:
movaps xmm0, xmmword ptr…

MikeF
- 1,021
- 9
- 29
21
votes
2 answers
How to vectorize with gcc?
The v4 series of the gcc compiler can automatically vectorize loops using the SIMD processor on some modern CPUs, such as the AMD Athlon or Intel Pentium/Core chips. How is this done?

casualcoder
- 4,770
- 6
- 29
- 35