Questions tagged [simd]

Single instruction, multiple data (SIMD) is the concept of having each instruction operate on a small chunk or vector of data elements. CPU vector instruction sets include: x86 SSE and AVX, ARM NEON, and PowerPC AltiVec. To efficiently use SIMD instructions, data needs to be in structure-of-arrays form and should occur in longer streams. Naively "SIMD optimized" code frequently surprises by running slower than the original.

2540 questions

vote

1 answer

What is the best way to loop AVX for un-even non-aligned array?

If array cannot be divided by 8 (for integer), what is the best way to write cycle for it? Possible way I figured out so far is to divide it into 2 separate cycles: 1 main cycle for almost all elements; and 1 tail cycle with maskload/maskstore for…

asked Nov 21 '22 at 08:10

Vladislav Kogan

vote

0 answers

how to use vpblendmd in avx-512 assembly

in the Intel® Intrinsics Guide it says Instruction: vpblendmd zmm {k}, zmm, zmm, I don't understand how to use the {k}, for example I write the following code: vpblendmd %zmm1,%zmm2,%zmm2{0xAAAA} but it's wrong for compiling and the compiler…

assembly x86 simd avx512

asked Nov 18 '22 at 03:28

anna

vote

0 answers

Accuracy of the ARM NEON vrsqrteq intrinsic

I am using the ARM NEON vrsqrteq intrinsic to calculate the approximate reciprocal square root of a vector of floats. I would like to know the accuracy of that approximation. However I can't find any documentation that provides this. The Neon…

floating-point arm simd intrinsics neon

asked Nov 14 '22 at 15:28

jonicho

vote

0 answers

SIMD intrinsics slower for cross products over an array of points than whatever GCC -O3 -march=native does on its own?

I try to use SIMD(x86 immintrin.h) to speed up my math, the code looks like this: #include #include #include class Point2 { public: Point2() = default; Point2(double xx, double yy):x_(xx), y_(yy) {} //…

c++ gcc x86-64 compiler-optimization simd

asked Nov 14 '22 at 11:32

komonzhang

vote

0 answers

Efficiently find indices of 1-bits in large array, using SIMD

If I have very large array of bytes and want to find indices of all 1-bits, indices counting from leftmost bit, how do I do this efficiently, probably using SIMD. (For finding the first 1-bit, see an earlier question. This question produces an…

c++ c++20 simd avx512 sse2

asked Nov 08 '22 at 06:26

Arty

14,883
6
36
69

vote

0 answers

C++ std::countr_zero() in SIMD 128/256/512 (find position of least significant 1 bit in 128/256/512-bit number)

If I have 128 or 256 or 512 bit memory region, how can I find number of consecutive zero bits starting from least significant bit (left-most byte). I can do: Try it online! #include int CountRZero512(uint64_t const * ptr) { for (int i = 0;…

c++ c++20 simd avx512 sse2

asked Nov 07 '22 at 19:27

Arty

14,883
6
36
69

vote

2 answers

How to deal with the lack of `simd_packed_float3` in Swift

There is no simd_packed_float3 type in Swift. Why it's a problem? Consider this Metal struct: struct Test{ packed_float3 x; float y; }; First of all, you can't calculate a buffer pointer to address the memory of y, since you can't do…

swift simd metal

asked Nov 04 '22 at 09:39

Roman Gaditskii

vote

1 answer

Search over an array of 14 integers, build a mask and return the match on ARMv8a using NEON

For my open source project cachegrand we are implementing AARCH64 support and although most of the port is completed we are sorting out a feature to perform an accelerated array search using NEON instructions. The logic we use is pretty simple: in…

linux gcc arm simd neon

asked Oct 18 '22 at 22:10

Daniele Salvatore Albano

1,263
2
13
29

vote

1 answer

SIMD transposition of 8x8 matrix of 32-bit values in Java

I found the following code in C++ for fast transposition of an 8x8 matrix of 32-bit values: https://stackoverflow.com/a/51887176/1915854 inline void Transpose8x8Shuff(unsigned long *in) { __m256 *inI = reinterpret_cast<__m256 *>(in); …

java c++ vectorization simd project-panama

asked Oct 06 '22 at 17:42

Serge Rogatch

13,865
7
86
158

vote

1 answer

JUnit tests do not seem to get run with --add-modules=jdk.incubator.vector from Maven

I've added SIMD code to a Java application that uses Maven to build, and now I have to run it like this: mvn exec:java -Dexec.mainClass="com.path.to.app.MainClass" -Dexec.classpathScope=runtime -Dexec.systemProperties="-da…

java maven junit simd java-module

asked Oct 05 '22 at 09:08

Serge Rogatch

13,865
7
86
158

vote

1 answer

Why on earth would I want to use PMULHRSW/VPMULHRSW?

I was looking for an appropriate AVX2 multiplication instruction to use in my code, and came across the vpmulhrsw (_mm256_mulhrs_epi16(__m256i a, __m256i b)) instruction. The description on the Intel Intrinsics Guide says: Multiply packed signed…

x86 multiplication simd micro-optimization avx2

asked Oct 04 '22 at 03:46

Bernard

5,209
1
34
64

vote

0 answers

_mm_load_si128 is NOT throwing on unaligned access

Intel's manual mentions that, it may generate exception, wording seems a little bit interesting. Load 128-bits of integer data from memory into dst. mem_addr must be aligned on a 16-byte boundary or a general-protection exception may be…

c++ visual-c++ simd sse memory-alignment

asked Sep 30 '22 at 17:55

Hasan Emrah Süngü

3,488
1
15
33

vote

1 answer

How do you implement an efficient parallel SIMD compare and select in Cg?

How do you do parallel selection efficiently ? For example, given this scalar code, is there a way to write it so the Cg compiler will make the code execute in parallel / SIMD (and potential using a branchfree selection as well). Out.x =…

selection parallel-processing shader simd cg

asked Sep 12 '11 at 16:18

Adisak

6,708
38
46

vote

1 answer

Vector overload of a function (provide a manually vectorized version of a function for auto-vectorization to use)

I am using C, and I want to have two versions of the same function, a scalar version and a vector version. The two functions the same signature, and the compiler should pick the correct version depending on the context - if the context is…

gcc x86-64 vectorization openmp simd

asked Sep 15 '22 at 11:13

Bogi

2,274
5
26
34

vote

2 answers

How to use AVX intrinsics to compare two vectors of packed double precision in C

I would like to use _mm512_mask_cmple_pd_mask to compare two packed double precision vectors. My issue is that the result comes as __mmask8 type... So I guess that my question is how I convert such mask into packed integer vectors, so I can use the…

simd intrinsics avx avx512

asked Sep 13 '22 at 21:13

Jofre

Prev 1 2 3

…

100