Questions tagged [simd]

Single instruction, multiple data (SIMD) is the concept of having each instruction operate on a small chunk or vector of data elements. CPU vector instruction sets include: x86 SSE and AVX, ARM NEON, and PowerPC AltiVec. To efficiently use SIMD instructions, data needs to be in structure-of-arrays form and should occur in longer streams. Naively "SIMD optimized" code frequently surprises by running slower than the original.

2540 questions

vote

0 answers

GCC error for ""vmull.u16 q7, d19, d8[0]" but not for ""vmull.u16 q7, d19, d7[0]"

I am using Arm GNU Toolchain 12.2.Rel1 (Build arm-12.24)) 12.2.1 20221205 on Windows 11, and on compilation of a sequence of NEON instructions (vector multiplication by scalar): vmull.u16 q7, d19, d0[0] vmull.u16 q7, d19, d8[0] the first one…

asked Jul 30 '23 at 19:14

jcdmelo

vote

0 answers

How should I chose between _mm_move_sd / _mm_shuffle_pd / _mm_blend_pd

If I am not mistaken, _mm_shuffle_pd(x, y, _MM_SHUFFLE2(0, 1)); and _mm_move_sd(x, y); And also _mm_blend_pd in a later instruction set should all do the same thing. But clang and gcc generate different instructions on sse2 godbolt. And they…

x86 simd

asked Jun 26 '23 at 00:09

Denis Yaroshevskiy

1,218
11
24

vote

0 answers

Sum of bytes in an __m128 register

I am trying to find the sum of all bytes in an __m128 register using SSE and SSE2. So far what I have is __m128i sum = _mm_sad_epu8(bytes, _mm_setzero_si128()); return _mm_cvtsi128_si32(sum) + _mm_extract_epi16(sum, 4); where bytes is the __m128…

c simd sse sse2

asked Jun 08 '23 at 22:36

user17784058

vote

1 answer

SIMD Intrinsics AVX. Tried to use _mm256_mullo_epi64. But got 0xC000001D: Illegal Instruction exception

I want to multiply two NxN matrices using SIMD. I want to do matrix multiplication for 64-bit integers, and multiply one element of a matrix with another element with the same index. For example: c[1][1] = a[1][1] * b[1][1] An error occurs when…

c++ exception simd avx avx2

asked Jun 08 '23 at 22:18

hellicop11

vote

1 answer

The fastest way to convert a UInt64 hex string to a UInt32 value preserving as many leading digits as possible, i.e. truncation

I'm looking for the fastest way to parse a hex string representing a ulong into a uint keeping as many leading digits as a uint can handle and discarding the rest. For example, string hex = "0xab54a9a1df8a0edb"; // 12345678991234567899 Should…

c# parsing decimal simd truncation

asked Jun 01 '23 at 04:27

Vas

vote

0 answers

How to use Java 17 vector instructions to optimize matrix multiplication?

I'm trying to optimize matrix multiplication implemented using Java with nested loops. I'm planning to use Java 17 vector API to optimize performance. I have read the documentation of the Vector API, but I am not sure how to apply them to my matrix…

java vectorization matrix-multiplication simd

asked May 26 '23 at 13:09

Isuru Perera

vote

3 answers

Call libmvec functions manually on __m128 vectors?

According to this page https://sourceware.org/glibc/wiki/libmvec, I should be able to manually vectorize a few complicated instructions like cosine by using the libmvec functions. However, I don't know how to get gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1)…

c simd sse glibc intrinsics

asked May 25 '23 at 22:42

Simon Goater

vote

1 answer

SIMD: how to find minimum values among 4 __m256d registers with its index

I have 4 _m256d, how can I find the minimum among all 16 values? How can I know the minimum value come from which __m256d variable? and which element is it? assume part of values are the same in different __m256d variable I'm trying but it doesn't…

vectorization simd avx

asked May 10 '23 at 15:47

holmessh

vote

2 answers

How to check whether odd lane is in a given range when its previous even lane equals to some value using SIMD?

This question is an extension of How to check if even/odd lanes are in given ranges using SIMD?. Given a __m128i which stores 16 chars, the even-index lane refers to even lane (i.e., lanes at 0, 2, 4, ..., 14), and odd-index lane refers to odd lane…

x86 simd sse

asked May 05 '23 at 09:09

chenzhongpu

6,193
8
41
79

vote

1 answer

Efficiently extract single double element from AVX-512 vector

I was wondering what the most efficient way is to extract a single double element from an AVX-512 vector without spilling it, using intrinsics. Currently i'm doing a masked reduce add: double extract(int idx, __m512d v){ __mmask8 mask =…

simd intrinsics avx512

asked Apr 12 '23 at 16:42

lulle2007200

vote

1 answer

How to multiply-accumulate unsigned bytes into 32-bit elements without overflow with RISC-V extension "V" SIMD vectors?

I am writing vector code with RISC-V intrinsics for extension V vectors, but this question probably applies to vectorisation generally. I need to multiply and accumulate lots of uint8 values. To do this I want to fill the vector registers with…

c vectorization simd intrinsics riscv

asked Apr 03 '23 at 20:53

confusedandsad

vote

1 answer

Split bit mask into sub-masks on set bits

I have a mask with a small number of set bits, just 3 or 4 of them. The mask can be up to 64 bit but let's take a short example - 10100101 I'd like to generate masks that stop at the set bits but include the lower bits up to the previous stop…

assembly bit-manipulation x86-64 simd intrinsics

asked Mar 20 '23 at 23:38

BitWhistler

1,439
8
12

vote

1 answer

How to check whether all bits of ByteVector are 0?

I am using the SIMD api in Java: // both `buffer` and `markVector` are ByteVector var result = buffer.and(markVector); My requirement is to check whether all bits in result are 0 efficiently. A workaround way is to convert it to byte[], and then…

java simd

asked Mar 06 '23 at 06:57

chenzhongpu

6,193
8
41
79

vote

2 answers

Finding closest matching color from an array

I'm working on implementing a terminal renderer, and after quantizing the original image down to 256 colors, I need to find the closest representation for each pixel of the image. I was looking at doing this by comparing the squared distances, but…

c colors x86 simd avx2

asked Mar 05 '23 at 16:35

Cloud11665

vote

1 answer

How to get _mm256_rcp_pd in AVX2?

For some reason _mm256_rcp_pd is not in AVX or AVX2. In AVX512 we got _mm256_rcp14_pd. Is there a way to get a fast approximate reciprocal in double precision on AVX2? Are we supposed to convert to single precision and then back?

simd avx2

asked Mar 03 '23 at 14:35

Unlikus

1,419
10
24

Prev 1 2 3

…

99 100 Next