Single instruction, multiple data (SIMD) is the concept of having each instruction operate on a small chunk or vector of data elements. CPU vector instruction sets include: x86 SSE and AVX, ARM NEON, and PowerPC AltiVec. To efficiently use SIMD instructions, data needs to be in structure-of-arrays form and should occur in longer streams. Naively "SIMD optimized" code frequently surprises by running slower than the original.
Questions tagged [simd]
2540 questions
1
vote
1 answer
What is the best way to loop AVX for un-even non-aligned array?
If array cannot be divided by 8 (for integer), what is the best way to write cycle for it? Possible way I figured out so far is to divide it into 2 separate cycles: 1 main cycle for almost all elements; and 1 tail cycle with maskload/maskstore for…

Vladislav Kogan
- 561
- 6
- 15
1
vote
0 answers
how to use vpblendmd in avx-512 assembly
in the Intel® Intrinsics Guide it says Instruction: vpblendmd zmm {k}, zmm, zmm, I don't understand how to use the {k}, for example I write the following code:
vpblendmd %zmm1,%zmm2,%zmm2{0xAAAA}
but it's wrong for compiling and the compiler…

anna
- 39
- 3
1
vote
0 answers
Accuracy of the ARM NEON vrsqrteq intrinsic
I am using the ARM NEON vrsqrteq intrinsic to calculate the approximate reciprocal square root of a vector of floats. I would like to know the accuracy of that approximation.
However I can't find any documentation that provides this.
The Neon…

jonicho
- 55
- 1
- 5
1
vote
0 answers
SIMD intrinsics slower for cross products over an array of points than whatever GCC -O3 -march=native does on its own?
I try to use SIMD(x86 immintrin.h) to speed up my math, the code looks like this:
#include
#include
#include
class Point2 {
public:
Point2() = default;
Point2(double xx, double yy):x_(xx), y_(yy) {}
//…

komonzhang
- 67
- 5
1
vote
0 answers
Efficiently find indices of 1-bits in large array, using SIMD
If I have very large array of bytes and want to find indices of all 1-bits, indices counting from leftmost bit, how do I do this efficiently, probably using SIMD.
(For finding the first 1-bit, see an earlier question. This question produces an…

Arty
- 14,883
- 6
- 36
- 69
1
vote
0 answers
C++ std::countr_zero() in SIMD 128/256/512 (find position of least significant 1 bit in 128/256/512-bit number)
If I have 128 or 256 or 512 bit memory region, how can I find number of consecutive zero bits starting from least significant bit (left-most byte). I can do:
Try it online!
#include
int CountRZero512(uint64_t const * ptr) {
for (int i = 0;…

Arty
- 14,883
- 6
- 36
- 69
1
vote
2 answers
How to deal with the lack of `simd_packed_float3` in Swift
There is no simd_packed_float3 type in Swift.
Why it's a problem?
Consider this Metal struct:
struct Test{
packed_float3 x;
float y;
};
First of all, you can't calculate a buffer pointer to address the memory of y, since you can't do…

Roman Gaditskii
- 135
- 6
1
vote
1 answer
Search over an array of 14 integers, build a mask and return the match on ARMv8a using NEON
For my open source project cachegrand we are implementing AARCH64 support and although most of the port is completed we are sorting out a feature to perform an accelerated array search using NEON instructions.
The logic we use is pretty simple:
in…

Daniele Salvatore Albano
- 1,263
- 2
- 13
- 29
1
vote
1 answer
SIMD transposition of 8x8 matrix of 32-bit values in Java
I found the following code in C++ for fast transposition of an 8x8 matrix of 32-bit values: https://stackoverflow.com/a/51887176/1915854
inline void Transpose8x8Shuff(unsigned long *in)
{
__m256 *inI = reinterpret_cast<__m256 *>(in);
…

Serge Rogatch
- 13,865
- 7
- 86
- 158
1
vote
1 answer
JUnit tests do not seem to get run with --add-modules=jdk.incubator.vector from Maven
I've added SIMD code to a Java application that uses Maven to build, and now I have to run it like this:
mvn exec:java -Dexec.mainClass="com.path.to.app.MainClass" -Dexec.classpathScope=runtime -Dexec.systemProperties="-da…

Serge Rogatch
- 13,865
- 7
- 86
- 158
1
vote
1 answer
Why on earth would I want to use PMULHRSW/VPMULHRSW?
I was looking for an appropriate AVX2 multiplication instruction to use in my code, and came across the vpmulhrsw (_mm256_mulhrs_epi16(__m256i a, __m256i b)) instruction.
The description on the Intel Intrinsics Guide says:
Multiply packed signed…

Bernard
- 5,209
- 1
- 34
- 64
1
vote
0 answers
_mm_load_si128 is NOT throwing on unaligned access
Intel's manual mentions that, it may generate exception, wording seems a little bit interesting.
Load 128-bits of integer data from memory into dst. mem_addr must be aligned on a 16-byte boundary or a general-protection exception may be…

Hasan Emrah Süngü
- 3,488
- 1
- 15
- 33
1
vote
1 answer
How do you implement an efficient parallel SIMD compare and select in Cg?
How do you do parallel selection efficiently ?
For example, given this scalar code, is there a way to write it so the Cg compiler will make the code execute in parallel / SIMD (and potential using a branchfree selection as well).
Out.x =…

Adisak
- 6,708
- 38
- 46
1
vote
1 answer
Vector overload of a function (provide a manually vectorized version of a function for auto-vectorization to use)
I am using C, and I want to have two versions of the same function, a scalar version and a vector version. The two functions the same signature, and the compiler should pick the correct version depending on the context - if the context is…

Bogi
- 2,274
- 5
- 26
- 34
1
vote
2 answers
How to use AVX intrinsics to compare two vectors of packed double precision in C
I would like to use _mm512_mask_cmple_pd_mask to compare two packed double precision vectors. My issue is that the result comes as __mmask8 type...
So I guess that my question is how I convert such mask into packed integer vectors, so I can use the…

Jofre
- 31
- 3