Questions tagged [simd]

Single instruction, multiple data (SIMD) is the concept of having each instruction operate on a small chunk or vector of data elements. CPU vector instruction sets include: x86 SSE and AVX, ARM NEON, and PowerPC AltiVec. To efficiently use SIMD instructions, data needs to be in structure-of-arrays form and should occur in longer streams. Naively "SIMD optimized" code frequently surprises by running slower than the original.

2540 questions
24
votes
2 answers

inlining failed in call to always_inline ‘_mm_mullo_epi32’: target specific option mismatch

I am trying to compile a C program using cmake which uses SIMD intrinsics. When I try to compile it, I get two errors /usr/lib/gcc/x86_64-linux-gnu/5/include/smmintrin.h:326:1: error: inlining failed in call to always_inline ‘_mm_mullo_epi32’:…
Lawan subba
  • 610
  • 3
  • 7
  • 19
24
votes
2 answers

Do all 64 bit intel architectures support SSSE3/SSE4.1/SSE4.2 instructions?

I did searched on web and intel Software manual . But am unable to confirm if all Intel 64 architectures support upto SSSE3 or upto SSE4.1 or upto SSE4.2 or AVX etc. So that I would be able to use minimum SIMD supported instructions in my programme.…
Vikram Dattu
  • 801
  • 3
  • 8
  • 24
23
votes
3 answers

Fastest way to do horizontal vector sum with AVX instructions

I have a packed vector of four 64-bit floating-point values. I would like to get the sum of the vector's elements. With SSE (and using 32-bit floats) I could just do the following: v_sum = _mm_hadd_ps(v_sum, v_sum); v_sum = _mm_hadd_ps(v_sum,…
Luigi Castelli
  • 676
  • 2
  • 6
  • 13
23
votes
2 answers

Does browser JavaScript allow for SIMD or Vectorized operations?

I want to write applications in JavaScript that require a large amount of numerical computation. However, I'm very confused about the state of efficient linear-algebra-like computation in client-side JavaScript. There seems to be many approaches,…
Seanny123
  • 8,776
  • 13
  • 68
  • 124
22
votes
5 answers

Optimizing Array Compaction

Let's say I have an array k = [1 2 0 0 5 4 0] I can compute a mask as follows m = k > 0 = [1 1 0 0 1 1 0] Using only the mask m and the following operations Shift left / right And/Or Add/Subtract/Multiply I can compact k into the following [1 2 5…
jameszhao00
  • 7,213
  • 15
  • 62
  • 112
22
votes
8 answers

c++ SSE SIMD framework

Does anyone know an open-source C++ x86 SIMD intrinsics library? Intel supplies exactly what I need in their integrated performance primitives library, but I can't use that because of the copyrights all over the place. EDIT I already know the…
user283145
22
votes
1 answer

Fastest way to compute absolute value using SSE

I am aware of 3 methods, but as far as I know, only the first 2 are generally used: Mask off the sign bit using andps or andnotps. Pros: One fast instruction if the mask is already in a register, which makes it perfect for doing this many times in…
Kumputer
  • 588
  • 1
  • 6
  • 22
22
votes
6 answers

How to use the Intel AVX in Java?

How do I use the Intel AVX vector instruction set from Java? It's a simple question but the answer seems to be hard to find.
Albert Hendriks
  • 1,979
  • 3
  • 25
  • 45
22
votes
5 answers

Transpose an 8x8 float using AVX/AVX2

Transposing a 8x8 matrix can be achieved by making four 4x4 matrices, and transposing each of them. This is not want I'm going for. In another question, one answer gave a solution that would only require 24 instructions for an 8x8 matrix. However,…
DavidS
  • 1,660
  • 1
  • 12
  • 26
22
votes
5 answers

How to combine two __m128 values to __m256?

I would like to combine two __m128 values to one __m256. Something like this: __m128 a = _mm_set_ps(1, 2, 3, 4); __m128 b = _mm_set_ps(5, 6, 7, 8); to something like: __m256 c = { 1, 2, 3, 4, 5, 6, 7, 8 }; are there any intrinsics that I can…
user1468756
  • 331
  • 2
  • 8
22
votes
5 answers

SIMD prefix sum on Intel cpu

I need to implement a prefix sum algorithm and would need it to be as fast as possible. Ex: [3, 1, 7, 0, 4, 1, 6, 3] should give: [3, 4, 11, 11, 15, 16, 22, 25] Is there a way to do this using SSE SIMD CPU instruction? My first idea is to…
skyde
  • 2,816
  • 4
  • 34
  • 53
21
votes
1 answer

IntStream leads to array elements being wrongly set to 0 (JVM Bug, Java 11)

In the class P below, the method test seems to return identically false: import java.util.function.IntPredicate; import java.util.stream.IntStream; public class P implements IntPredicate { private final static int SIZE = 33; @Override …
p_i
  • 313
  • 1
  • 4
21
votes
2 answers

Modern approach to making std::vector allocate aligned memory

The following question is related, however answers are old, and comment from user Marc Glisse suggests there are new approaches since C++17 to this problem that might not be adequately discussed. I'm trying to get aligned memory working properly for…
Prunus Persica
  • 1,173
  • 9
  • 27
21
votes
2 answers

Choice between aligned vs. unaligned x86 SIMD instructions

There are generally two types of SIMD instructions: A. Ones that work with aligned memory addresses, that will raise general-protection (#GP) exception if the address is not aligned on the operand size boundary: movaps xmm0, xmmword ptr…
MikeF
  • 1,021
  • 9
  • 29
21
votes
2 answers

How to vectorize with gcc?

The v4 series of the gcc compiler can automatically vectorize loops using the SIMD processor on some modern CPUs, such as the AMD Athlon or Intel Pentium/Core chips. How is this done?