Questions tagged [simd]

Single instruction, multiple data (SIMD) is the concept of having each instruction operate on a small chunk or vector of data elements. CPU vector instruction sets include: x86 SSE and AVX, ARM NEON, and PowerPC AltiVec. To efficiently use SIMD instructions, data needs to be in structure-of-arrays form and should occur in longer streams. Naively "SIMD optimized" code frequently surprises by running slower than the original.

2540 questions
21
votes
3 answers

How to solve the 32-byte-alignment issue for AVX load/store operations?

I am having alignment issue while using ymm registers, with some snippets of code that seems fine to me. Here is a minimal working example: #include #include inline void ones(float *a) { __m256 out_aligned =…
romeric
  • 2,325
  • 3
  • 19
  • 35
21
votes
4 answers

SSE2 integer overflow checking

When using SSE2 instructions such as PADDD (i.e., the _mm_add_epi32 intrinsic), is there a way to check whether any of the operations overflowed? I thought that maybe a flag on the MXCSR control register may get set after an overflow, but I don't…
Igor ostrovsky
  • 7,282
  • 2
  • 29
  • 28
20
votes
3 answers

What's the difference between logical SSE intrinsics?

Is there any difference between logical SSE intrinsics for different types? For example if we take OR operation, there are three intrinsics: _mm_or_ps, _mm_or_pd and _mm_or_si128 all of which do the same thing: compute bitwise OR of their operands.…
user283145
20
votes
3 answers

Fast, branchless unsigned int absolute difference

I have a program which spends most of its time computing the Euclidean distance between RGB values (3-tuples of unsigned 8-bit Word8). I need a fast, branchless unsigned int absolute difference function such that unsigned_difference :: Word8 ->…
cdk
  • 6,698
  • 24
  • 51
20
votes
5 answers

SSE-copy, AVX-copy and std::copy performance

I'm tried to improve performance of copy operation via SSE and AVX: #include const int sz = 1024; float *mas = (float *)_mm_malloc(sz*sizeof(float), 16); float *tar = (float *)_mm_malloc(sz*sizeof(float), 16); …
gorill
  • 1,623
  • 3
  • 20
  • 29
19
votes
4 answers

How to find the horizontal maximum in a 256-bit AVX vector

I have a __m256d vector packed with four 64-bit floating-point values. I need to find the horizontal maximum of the vector's elements and store the result in a double-precision scalar value; My attempts all ended up using a lot of shuffling of the…
Luigi Castelli
  • 676
  • 2
  • 6
  • 13
19
votes
1 answer

AVX2: Computing dot product of 512 float arrays

I will preface this by saying that I am a complete beginner at SIMD intrinsics. Essentially, I have a CPU which supports the AVX2 instrinsic (Intel(R) Core(TM) i5-7500T CPU @ 2.70GHz). I would like to know the fastest way to compute the dot product…
cyrusbehr
  • 1,100
  • 1
  • 12
  • 32
19
votes
3 answers

SSE: Difference between _mm_load/store vs. using direct pointer access

Suppose I want to add two buffers and store the result. Both buffers are already allocated 16byte aligned. I found two examples how to do that. The first one is using _mm_load to read the data from the buffer into an SSE register, does the add…
Peter
  • 785
  • 2
  • 7
  • 18
18
votes
2 answers

Reference manual/tutorial for x86 SIMD intrinsics?

I'm looking into using these to improve the performance of some code but good documentation seems hard to find for the functions defined in the *mmintrin.h headers, can anybody provide me with pointers to good info on these? EDIT: particularly…
BD at Rivenhill
  • 12,395
  • 10
  • 46
  • 49
18
votes
3 answers

How to convert a binary integer number to a hex string?

Given a number in a register (a binary integer), how to convert it to a string of hexadecimal ASCII digits? (i.e. serialize it into a text format.) Digits can be stored in memory or printed on the fly, but storing in memory and printing all at once…
Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
18
votes
2 answers

How to transpose a 16x16 matrix using SIMD instructions?

I'm currently writing some code targeting Intel's forthcoming AVX-512 SIMD instructions, which supports 512-bit operations. Now assuming there's a matrix represented by 16 SIMD registers, each holding 16 32-bit integers (corresponds to a row), how…
lei_z
  • 1,049
  • 2
  • 13
  • 27
18
votes
3 answers

Should I use SIMD or vector extensions or something else?

I'm currently develop an open source 3D application framework in c++ (with c++11). My own math library is designed like the XNA math library, also with SIMD in mind. But currently it is not really fast, and it has problems with memory alignes, but…
pearcoding
  • 1,149
  • 1
  • 9
  • 28
18
votes
2 answers

SSE multiplication of 4 32-bit integers

How to multiply four 32-bit integers by another 4 integers? I didn't find any instruction which can do it.
Yury
  • 1,169
  • 2
  • 16
  • 29
18
votes
2 answers

Push XMM register to the stack

Is there a way of pushing a packed doubleword integer from XMM register to the stack? and then later on pop it back when needed? Ideally I am looking for something like PUSH or POP for general purpose registers, I have checked Intel manuals but I…
Daniel Gruszczyk
  • 5,379
  • 8
  • 47
  • 86
17
votes
3 answers

adding the components of an SSE register

I want to add the four components of an SSE register to get a single float. This is how I do it now: float a[4]; _mm_storeu_ps(a, foo128); float x = a[0] + a[1] + a[2] + a[3]; Is there an SSE instruction that directly achieves this?
fredoverflow
  • 256,549
  • 94
  • 388
  • 662