Questions tagged [simd]

Single instruction, multiple data (SIMD) is the concept of having each instruction operate on a small chunk or vector of data elements. CPU vector instruction sets include: x86 SSE and AVX, ARM NEON, and PowerPC AltiVec. To efficiently use SIMD instructions, data needs to be in structure-of-arrays form and should occur in longer streams. Naively "SIMD optimized" code frequently surprises by running slower than the original.

2540 questions
1
vote
0 answers

GCC error for ""vmull.u16 q7, d19, d8[0]" but not for ""vmull.u16 q7, d19, d7[0]"

I am using Arm GNU Toolchain 12.2.Rel1 (Build arm-12.24)) 12.2.1 20221205 on Windows 11, and on compilation of a sequence of NEON instructions (vector multiplication by scalar): vmull.u16 q7, d19, d0[0] vmull.u16 q7, d19, d8[0] the first one…
jcdmelo
  • 11
  • 2
1
vote
0 answers

How should I chose between _mm_move_sd / _mm_shuffle_pd / _mm_blend_pd

If I am not mistaken, _mm_shuffle_pd(x, y, _MM_SHUFFLE2(0, 1)); and _mm_move_sd(x, y); And also _mm_blend_pd in a later instruction set should all do the same thing. But clang and gcc generate different instructions on sse2 godbolt. And they…
Denis Yaroshevskiy
  • 1,218
  • 11
  • 24
1
vote
0 answers

Sum of bytes in an __m128 register

I am trying to find the sum of all bytes in an __m128 register using SSE and SSE2. So far what I have is __m128i sum = _mm_sad_epu8(bytes, _mm_setzero_si128()); return _mm_cvtsi128_si32(sum) + _mm_extract_epi16(sum, 4); where bytes is the __m128…
1
vote
1 answer

SIMD Intrinsics AVX. Tried to use _mm256_mullo_epi64. But got 0xC000001D: Illegal Instruction exception

I want to multiply two NxN matrices using SIMD. I want to do matrix multiplication for 64-bit integers, and multiply one element of a matrix with another element with the same index. For example: c[1][1] = a[1][1] * b[1][1] An error occurs when…
hellicop11
  • 13
  • 2
1
vote
1 answer

The fastest way to convert a UInt64 hex string to a UInt32 value preserving as many leading digits as possible, i.e. truncation

I'm looking for the fastest way to parse a hex string representing a ulong into a uint keeping as many leading digits as a uint can handle and discarding the rest. For example, string hex = "0xab54a9a1df8a0edb"; // 12345678991234567899 Should…
Vas
  • 747
  • 1
  • 12
  • 18
1
vote
0 answers

How to use Java 17 vector instructions to optimize matrix multiplication?

I'm trying to optimize matrix multiplication implemented using Java with nested loops. I'm planning to use Java 17 vector API to optimize performance. I have read the documentation of the Vector API, but I am not sure how to apply them to my matrix…
Isuru Perera
  • 351
  • 1
  • 3
  • 13
1
vote
3 answers

Call libmvec functions manually on __m128 vectors?

According to this page https://sourceware.org/glibc/wiki/libmvec, I should be able to manually vectorize a few complicated instructions like cosine by using the libmvec functions. However, I don't know how to get gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1)…
Simon Goater
  • 759
  • 1
  • 1
  • 7
1
vote
1 answer

SIMD: how to find minimum values among 4 __m256d registers with its index

I have 4 _m256d, how can I find the minimum among all 16 values? How can I know the minimum value come from which __m256d variable? and which element is it? assume part of values are the same in different __m256d variable I'm trying but it doesn't…
holmessh
  • 65
  • 5
1
vote
2 answers

How to check whether odd lane is in a given range when its previous even lane equals to some value using SIMD?

This question is an extension of How to check if even/odd lanes are in given ranges using SIMD?. Given a __m128i which stores 16 chars, the even-index lane refers to even lane (i.e., lanes at 0, 2, 4, ..., 14), and odd-index lane refers to odd lane…
chenzhongpu
  • 6,193
  • 8
  • 41
  • 79
1
vote
1 answer

Efficiently extract single double element from AVX-512 vector

I was wondering what the most efficient way is to extract a single double element from an AVX-512 vector without spilling it, using intrinsics. Currently i'm doing a masked reduce add: double extract(int idx, __m512d v){ __mmask8 mask =…
lulle2007200
  • 888
  • 9
  • 20
1
vote
1 answer

How to multiply-accumulate unsigned bytes into 32-bit elements without overflow with RISC-V extension "V" SIMD vectors?

I am writing vector code with RISC-V intrinsics for extension V vectors, but this question probably applies to vectorisation generally. I need to multiply and accumulate lots of uint8 values. To do this I want to fill the vector registers with…
1
vote
1 answer

Split bit mask into sub-masks on set bits

I have a mask with a small number of set bits, just 3 or 4 of them. The mask can be up to 64 bit but let's take a short example - 10100101 I'd like to generate masks that stop at the set bits but include the lower bits up to the previous stop…
BitWhistler
  • 1,439
  • 8
  • 12
1
vote
1 answer

How to check whether all bits of ByteVector are 0?

I am using the SIMD api in Java: // both `buffer` and `markVector` are ByteVector var result = buffer.and(markVector); My requirement is to check whether all bits in result are 0 efficiently. A workaround way is to convert it to byte[], and then…
chenzhongpu
  • 6,193
  • 8
  • 41
  • 79
1
vote
2 answers

Finding closest matching color from an array

I'm working on implementing a terminal renderer, and after quantizing the original image down to 256 colors, I need to find the closest representation for each pixel of the image. I was looking at doing this by comparing the squared distances, but…
Cloud11665
  • 94
  • 1
  • 6
1
vote
1 answer

How to get _mm256_rcp_pd in AVX2?

For some reason _mm256_rcp_pd is not in AVX or AVX2. In AVX512 we got _mm256_rcp14_pd. Is there a way to get a fast approximate reciprocal in double precision on AVX2? Are we supposed to convert to single precision and then back?
Unlikus
  • 1,419
  • 10
  • 24