Single instruction, multiple data (SIMD) is the concept of having each instruction operate on a small chunk or vector of data elements. CPU vector instruction sets include: x86 SSE and AVX, ARM NEON, and PowerPC AltiVec. To efficiently use SIMD instructions, data needs to be in structure-of-arrays form and should occur in longer streams. Naively "SIMD optimized" code frequently surprises by running slower than the original.
Questions tagged [simd]
2540 questions
1
vote
0 answers
NumPy matrix multiplication is 20X slower than OpenCV's cvtColor
OpenCV converts BGR images to grayscale using the linear transformation Y = 0.299R + 0.587G + 0.114B, according to their documentation.
I tried to mimic it using NumPy, by multiplying the HxWx3 BGR matrix by the 3x1 vector of coefficients [0.114,…

SomethingSomething
- 11,491
- 17
- 68
- 126
1
vote
0 answers
Code measure phenomena when sequently using GPR, AVX2, AVX512 code
I'm measuring C/C++/intrinsics code execution on Intel Core CPU (RocketLake) and observing non-obvious measuring value shifts.
Two functions f_gpr() (GPR only instructions) and f_avx512() (AVX512 instructions there) run sequently and are measured…

Akon
- 335
- 1
- 11
1
vote
0 answers
Efficient way to expand a packed 32 bit array to 32 bytes
I've got a packed bit array stored as a 32 bit word. I'd like to expand it into an array of bytes, where each byte corresponds to one of the bits of the array. Here's an example to illustrate what I mean (showing only 8 bits for brevity):
int…

multitaskPro
- 569
- 4
- 14
1
vote
0 answers
C to VMIPS equivalent translation
I am wondering if the VMIPS code that I have written is equivalent to the C code snippet below. The target vector machine has length 64, and all variables are double-precision.
for (i = 0; i < 35; i=i+1) {
X[i] = A[i]*B[i]-C[i]*D[i];
Y[i] =…

maffffff
- 11
- 1
1
vote
0 answers
SIMD (AVX2, AVX512) big integer library
I'm finding good SIMD (AVX2, AVX512) library with C/C++ interface (C preferred) to process big arrays of signed and unsigned big integers (mainly, 128, 256, 512 bit wides).
SIMD parallelization, obviously, must work on the array level, not on a…

Akon
- 335
- 1
- 11
1
vote
0 answers
how to cast __m128 to union when returning
I want to return the result of _mm_add_ps() but the returning type should be a custon union that has __m128 member inside.
I tested the performance of returning __m128 and a custom union. It seems that on MSVC this:
return _mm_add_ps(V1, V2);
is…

Zer0day
- 89
- 5
1
vote
1 answer
Fastest way to search an array on m1 mac
I am trying to load an array of u16s from memory and find the first element that is less than some number, as fast as possible on an M1 mac. I have been looking through the NEON instructions, but I wasn't able to find a good way to do it. There are…

Basic Block
- 729
- 9
- 17
1
vote
0 answers
Is there any difference when using AVX2 on the stack and the heap?
In this example I'm adding two arrays using AVX2. If I declare the arrays on the stack, it all works as expected. However, when the memory is allocated on the heap it compiles, but throws a Segmentation Fault at runtime.
Compilation succeeds but…

orientnab
- 68
- 6
1
vote
1 answer
Is `-ftree-loop-vectorize` not enabled by `-O2` in GCC v12?
Example: https://www.godbolt.org/z/ahfcaj7W8
From https://gcc.gnu.org/onlinedocs/gcc-12.2.0/gcc/Optimize-Options.html
It says
-ftree-loop-vectorize
Perform loop vectorization on trees. This flag is enabled by default at -O2 and by…

colinfang
- 20,909
- 19
- 90
- 173
1
vote
1 answer
ARM SVE: svld1(mask, ptr) vs svldff1(svptrue<>, ptr)
In ARM SVE there are masked load instructions svld1and there are also non-failing loads
svldff1(svptrue<>).
Questions:
Does it make sense to do svld1 with a mask as opppose to svldff1?
The behaviour of mask in svldff1 seems confusing. Is there a…

Denis Yaroshevskiy
- 1,218
- 11
- 24
1
vote
1 answer
In SIMD, SSE2,many instructions named as "_mm_set_epi8","_mm_cmpgt_epi8 " and so on,what does "mm" "epi" mean?
I see many instruction with shorthand such as "_mm_and_si128". I want to know what does the "mm" mean.

dongwang
- 13
- 2
1
vote
1 answer
Is there any performance difference between AVX-512 `_mm512_load_epi64` and `_mm512_loadu_epi64`?
The motivation for this question
The unaligned load is generally more common to use. The developer should use the aligned SIMD load when the address is already aligned. So I started to wonder if there are some performance differences between these…

Jigao Luo
- 127
- 1
- 6
1
vote
1 answer
Convert vector compare mask into bit mask in AArch64 SIMD or ARM NEON?
Lets take the example of "ABAA". I can use result = vceqq_u8(input, vdupq_n_u8('A')) to get FF 00 FF FF (or 0xFFFF00FF).
Sometimes I only need to know the first match, other times I want to know all. From the result register is there a way I can get…

Stan
- 161
- 8
1
vote
0 answers
Eigen3 : How to verify if AVX2 or AVX512F code is being generated?
I am developing an program that involves a lot of low latency hard-real time matrix operations. I am using Eigen 3 library for the same.
I wish to use AVX-512F SIMD vectorization in production for performance acceleration.
I am currently…

Dark Sorrow
- 1,681
- 14
- 37
1
vote
0 answers
Why are the alternate elements in a vector in the output of _mm256_mul_epi32 avx intrinsic instruction zero?
I am learning SIMD instructions. I tried to implement a vector dot product using avx intrinsic, but to my astonishment, I found that the alternate vectors in 256-bit vector collection are zeros
I tried to write a short code reproducing the error. I…

Abhishek Ghosh
- 597
- 7
- 18