Questions tagged [simd]

Single instruction, multiple data (SIMD) is the concept of having each instruction operate on a small chunk or vector of data elements. CPU vector instruction sets include: x86 SSE and AVX, ARM NEON, and PowerPC AltiVec. To efficiently use SIMD instructions, data needs to be in structure-of-arrays form and should occur in longer streams. Naively "SIMD optimized" code frequently surprises by running slower than the original.

2540 questions

votes

3 answers

How to solve the 32-byte-alignment issue for AVX load/store operations?

I am having alignment issue while using ymm registers, with some snippets of code that seems fine to me. Here is a minimal working example: #include #include inline void ones(float *a) { __m256 out_aligned =…

asked Sep 16 '15 at 14:57

romeric

2,325
3
19
35

votes

4 answers

SSE2 integer overflow checking

When using SSE2 instructions such as PADDD (i.e., the _mm_add_epi32 intrinsic), is there a way to check whether any of the operations overflowed? I thought that maybe a flag on the MXCSR control register may get set after an overflow, but I don't…

c++ x86 sse simd sse2

asked May 09 '12 at 06:44

Igor ostrovsky

7,282
2
29
28

votes

3 answers

What's the difference between logical SSE intrinsics?

Is there any difference between logical SSE intrinsics for different types? For example if we take OR operation, there are three intrinsics: _mm_or_ps, _mm_or_pd and _mm_or_si128 all of which do the same thing: compute bitwise OR of their operands.…

c sse simd intrinsics sse2

asked May 10 '10 at 17:32

user283145

votes

3 answers

Fast, branchless unsigned int absolute difference

I have a program which spends most of its time computing the Euclidean distance between RGB values (3-tuples of unsigned 8-bit Word8). I need a fast, branchless unsigned int absolute difference function such that unsigned_difference :: Word8 ->…

performance haskell bit-manipulation simd

asked Mar 17 '14 at 00:27

cdk

6,698
24
51

votes

5 answers

SSE-copy, AVX-copy and std::copy performance

I'm tried to improve performance of copy operation via SSE and AVX: #include const int sz = 1024; float *mas = (float *)_mm_malloc(sz*sizeof(float), 16); float *tar = (float *)_mm_malloc(sz*sizeof(float), 16); …

c++ performance sse simd avx

asked Aug 19 '13 at 13:04

gorill

1,623
3
20
29

votes

4 answers

How to find the horizontal maximum in a 256-bit AVX vector

I have a __m256d vector packed with four 64-bit floating-point values. I need to find the horizontal maximum of the vector's elements and store the result in a double-precision scalar value; My attempts all ended up using a lot of shuffling of the…

x86 simd avx vector-processing avx2

asked Mar 20 '12 at 21:48

Luigi Castelli

votes

1 answer

AVX2: Computing dot product of 512 float arrays

I will preface this by saying that I am a complete beginner at SIMD intrinsics. Essentially, I have a CPU which supports the AVX2 instrinsic (Intel(R) Core(TM) i5-7500T CPU @ 2.70GHz). I would like to know the fastest way to compute the dot product…

c++ simd avx2 dot-product fma

asked Dec 27 '19 at 00:23

cyrusbehr

1,100
1
12
32

votes

3 answers

SSE: Difference between _mm_load/store vs. using direct pointer access

Suppose I want to add two buffers and store the result. Both buffers are already allocated 16byte aligned. I found two examples how to do that. The first one is using _mm_load to read the data from the buffer into an SSE register, does the add…

x86 sse simd

asked Jun 14 '12 at 13:36

Peter

votes

2 answers

Reference manual/tutorial for x86 SIMD intrinsics?

I'm looking into using these to improve the performance of some code but good documentation seems hard to find for the functions defined in the *mmintrin.h headers, can anybody provide me with pointers to good info on these? EDIT: particularly…

simd sse intrinsics avx

asked Jul 28 '11 at 11:03

BD at Rivenhill

12,395
10
46
49

votes

3 answers

How to convert a binary integer number to a hex string?

Given a number in a register (a binary integer), how to convert it to a string of hexadecimal ASCII digits? (i.e. serialize it into a text format.) Digits can be stored in memory or printed on the fly, but storing in memory and printing all at once…

assembly x86 hex simd avx512

asked Dec 17 '18 at 22:14

Peter Cordes

328,167
45
605
847

votes

2 answers

How to transpose a 16x16 matrix using SIMD instructions?

I'm currently writing some code targeting Intel's forthcoming AVX-512 SIMD instructions, which supports 512-bit operations. Now assuming there's a matrix represented by 16 SIMD registers, each holding 16 32-bit integers (corresponds to a row), how…

assembly matrix intel simd avx512

asked Apr 08 '15 at 15:40

lei_z

1,049
2
13
27

votes

3 answers

Should I use SIMD or vector extensions or something else?

I'm currently develop an open source 3D application framework in c++ (with c++11). My own math library is designed like the XNA math library, also with SIMD in mind. But currently it is not really fast, and it has problems with memory alignes, but…

c++ gcc sse simd

asked May 23 '12 at 11:10

pearcoding

1,149
1
9
28

votes

2 answers

SSE multiplication of 4 32-bit integers

How to multiply four 32-bit integers by another 4 integers? I didn't find any instruction which can do it.

x86 sse simd multiplication sse2

asked May 08 '12 at 14:37

Yury

1,169
2
16
29

votes

2 answers

Push XMM register to the stack

Is there a way of pushing a packed doubleword integer from XMM register to the stack? and then later on pop it back when needed? Ideally I am looking for something like PUSH or POP for general purpose registers, I have checked Intel manuals but I…

assembly x86 simd sse

asked Apr 15 '12 at 12:13

Daniel Gruszczyk

5,379
8
47
86

votes

3 answers

adding the components of an SSE register

I want to add the four components of an SSE register to get a single float. This is how I do it now: float a[4]; _mm_storeu_ps(a, foo128); float x = a[0] + a[1] + a[2] + a[3]; Is there an SSE instruction that directly achieves this?

c++ floating-point sse simd addition

asked Dec 16 '11 at 15:06

fredoverflow

256,549
94
388
662

Prev 1 2 3

…

99 100 Next