Questions tagged [simd]

Single instruction, multiple data (SIMD) is the concept of having each instruction operate on a small chunk or vector of data elements. CPU vector instruction sets include: x86 SSE and AVX, ARM NEON, and PowerPC AltiVec. To efficiently use SIMD instructions, data needs to be in structure-of-arrays form and should occur in longer streams. Naively "SIMD optimized" code frequently surprises by running slower than the original.

2540 questions

vote

0 answers

NumPy matrix multiplication is 20X slower than OpenCV's cvtColor

OpenCV converts BGR images to grayscale using the linear transformation Y = 0.299R + 0.587G + 0.114B, according to their documentation. I tried to mimic it using NumPy, by multiplying the HxWx3 BGR matrix by the 3x1 vector of coefficients [0.114,…

numpy opencv simd

asked Feb 15 '23 at 13:20

SomethingSomething

11,491
17
68
126

vote

0 answers

Code measure phenomena when sequently using GPR, AVX2, AVX512 code

I'm measuring C/C++/intrinsics code execution on Intel Core CPU (RocketLake) and observing non-obvious measuring value shifts. Two functions f_gpr() (GPR only instructions) and f_avx512() (AVX512 instructions there) run sequently and are measured…

performance cpu-architecture simd avx512

asked Feb 08 '23 at 15:48

Akon

vote

0 answers

Efficient way to expand a packed 32 bit array to 32 bytes

I've got a packed bit array stored as a 32 bit word. I'd like to expand it into an array of bytes, where each byte corresponds to one of the bits of the array. Here's an example to illustrate what I mean (showing only 8 bits for brevity): int…

c++ bit-manipulation simd avx2

asked Jan 28 '23 at 20:39

multitaskPro

vote

0 answers

C to VMIPS equivalent translation

I am wondering if the VMIPS code that I have written is equivalent to the C code snippet below. The target vector machine has length 64, and all variables are double-precision. for (i = 0; i < 35; i=i+1) { X[i] = A[i]*B[i]-C[i]*D[i]; Y[i] =…

assembly mips simd

asked Jan 22 '23 at 21:28

maffffff

vote

0 answers

SIMD (AVX2, AVX512) big integer library

I'm finding good SIMD (AVX2, AVX512) library with C/C++ interface (C preferred) to process big arrays of signed and unsigned big integers (mainly, 128, 256, 512 bit wides). SIMD parallelization, obviously, must work on the array level, not on a…

simd biginteger gmp avx2 avx512

asked Jan 22 '23 at 09:06

Akon

vote

0 answers

how to cast __m128 to union when returning

I want to return the result of _mm_add_ps() but the returning type should be a custon union that has __m128 member inside. I tested the performance of returning __m128 and a custom union. It seems that on MSVC this: return _mm_add_ps(V1, V2); is…

c performance simd sse

asked Jan 10 '23 at 16:27

Zer0day

vote

1 answer

Fastest way to search an array on m1 mac

I am trying to load an array of u16s from memory and find the first element that is less than some number, as fast as possible on an M1 mac. I have been looking through the NEON instructions, but I wasn't able to find a good way to do it. There are…

assembly apple-m1 simd arm64 neon

asked Jan 09 '23 at 07:46

Basic Block

vote

0 answers

Is there any difference when using AVX2 on the stack and the heap?

In this example I'm adding two arrays using AVX2. If I declare the arrays on the stack, it all works as expected. However, when the memory is allocated on the heap it compiles, but throws a Segmentation Fault at runtime. Compilation succeeds but…

c simd avx2

asked Dec 25 '22 at 13:24

orientnab

vote

1 answer

Is `-ftree-loop-vectorize` not enabled by `-O2` in GCC v12?

Example: https://www.godbolt.org/z/ahfcaj7W8 From https://gcc.gnu.org/onlinedocs/gcc-12.2.0/gcc/Optimize-Options.html It says -ftree-loop-vectorize Perform loop vectorization on trees. This flag is enabled by default at -O2 and by…

c++ gcc compiler-optimization simd auto-vectorization

asked Dec 23 '22 at 10:30

colinfang

20,909
19
90
173

vote

1 answer

ARM SVE: svld1(mask, ptr) vs svldff1(svptrue<>, ptr)

In ARM SVE there are masked load instructions svld1and there are also non-failing loads svldff1(svptrue<>). Questions: Does it make sense to do svld1 with a mask as opppose to svldff1? The behaviour of mask in svldff1 seems confusing. Is there a…

arm simd sve

asked Dec 17 '22 at 19:46

Denis Yaroshevskiy

1,218
11
24

vote

1 answer

In SIMD, SSE2，many instructions named as "_mm_set_epi8"，"_mm_cmpgt_epi8 " and so on，what does "mm" "epi" mean?

I see many instruction with shorthand such as "_mm_and_si128". I want to know what does the "mm" mean.

c++ simd sse intrinsics sse2

asked Dec 17 '22 at 04:24

dongwang

vote

1 answer

Is there any performance difference between AVX-512 `_mm512_load_epi64` and `_mm512_loadu_epi64`?

The motivation for this question The unaligned load is generally more common to use. The developer should use the aligned SIMD load when the address is already aligned. So I started to wonder if there are some performance differences between these…

x86-64 intel simd amd-processor avx512

asked Dec 13 '22 at 13:05

Jigao Luo

vote

1 answer

Convert vector compare mask into bit mask in AArch64 SIMD or ARM NEON?

Lets take the example of "ABAA". I can use result = vceqq_u8(input, vdupq_n_u8('A')) to get FF 00 FF FF (or 0xFFFF00FF). Sometimes I only need to know the first match, other times I want to know all. From the result register is there a way I can get…

c assembly simd arm64 neon

asked Dec 07 '22 at 21:22

Stan

vote

0 answers

Eigen3 : How to verify if AVX2 or AVX512F code is being generated?

I am developing an program that involves a lot of low latency hard-real time matrix operations. I am using Eigen 3 library for the same. I wish to use AVX-512F SIMD vectorization in production for performance acceleration. I am currently…

c++ eigen simd avx avx512

asked Nov 30 '22 at 06:25

Dark Sorrow

1,681
14
37

vote

0 answers

Why are the alternate elements in a vector in the output of _mm256_mul_epi32 avx intrinsic instruction zero?

I am learning SIMD instructions. I tried to implement a vector dot product using avx intrinsic, but to my astonishment, I found that the alternate vectors in 256-bit vector collection are zeros I tried to write a short code reproducing the error. I…

c++ simd intrinsics avx avx2

asked Nov 23 '22 at 11:14

Abhishek Ghosh

Prev 1 2 3

…

100 Next