Single instruction, multiple data (SIMD) is the concept of having each instruction operate on a small chunk or vector of data elements. CPU vector instruction sets include: x86 SSE and AVX, ARM NEON, and PowerPC AltiVec. To efficiently use SIMD instructions, data needs to be in structure-of-arrays form and should occur in longer streams. Naively "SIMD optimized" code frequently surprises by running slower than the original.
Questions tagged [simd]
2540 questions
30
votes
2 answers
How to implement atoi using SIMD?
I'd like to try writing an atoi implementation using SIMD instructions, to be included in RapidJSON (a C++ JSON reader/writer library). It currently has some SSE2 and SSE4.2 optimizations in other places.
If it's a speed gain, multiple atoi results…

the_drow
- 18,571
- 25
- 126
- 193
29
votes
5 answers
Good portable SIMD library
can anyone recommend portable SIMD library that provides a c/c++ API, works on Intel and AMD extensions and Visual Studio, GCC compatible. I'm looking to speed up things like scaling a 512x512 array of doubles. Vector dot products, matrix…

Budric
- 3,599
- 8
- 35
- 38
29
votes
1 answer
Is my understanding of AoS vs SoA advantages/disadvantages correct?
I've recently been reading about AoS vs SoA structure design and data-oriented design. It's oddly difficult to find information about either, and what I have found seems to assume greater understanding of processor functionality than I possess. That…

P...
- 655
- 2
- 6
- 14
29
votes
5 answers
Get member of __m128 by index?
I've got some code, originally given to me by someone working with MSVC, and I'm trying to get it to work on Clang. Here's the function that I'm having trouble with:
float vectorGetByIndex( __m128 V, unsigned int i )
{
assert( i <= 3 );
…

benwad
- 6,414
- 10
- 59
- 93
28
votes
3 answers
How can I exchange the low 128 bits and high 128 bits in a 256 bit AVX (YMM) register
I am porting SSE SIMD code to use the 256 bit AVX extensions and cannot seem to find any instruction that will blend/shuffle/move the high 128 bits and the low 128 bits.
The backing story:
What I really want is VHADDPS/_mm256_hadd_ps to act like…

Mark Borgerding
- 8,117
- 4
- 30
- 51
28
votes
19 answers
How fast can you make linear search?
I'm looking to optimize this linear search:
static int
linear (const int *arr, int n, int key)
{
int i = 0;
while (i < n) {
if (arr [i] >= key)
break;
++i;
}
…

Mark Probst
- 7,107
- 7
- 40
- 42
27
votes
3 answers
How to efficiently perform double/int64 conversions with SSE/AVX?
SSE2 has instructions for converting vectors between single-precision floats and 32-bit integers.
_mm_cvtps_epi32()
_mm_cvtepi32_ps()
But there are no equivalents for double-precision and 64-bit integers. In other words, they are…

plasmacel
- 8,183
- 7
- 53
- 101
27
votes
2 answers
SIMD and difference between packed and scalar double precision
I am reading Intel's intrinsics guide while implementing SIMD support. I have a few confusions and my questions are as below.
__m128 _mm_cmpeq_ps (__m128 a, __m128 b) documentation says it is used to compare packed single precision floating points.…

user1461001
- 693
- 1
- 7
- 17
26
votes
2 answers
Haskell math performance on multiply-add operation
I'm writing a game in Haskell, and my current pass at the UI involves a lot of procedural generation of geometry. I am currently focused on identifying performance of one particular operation (C-ish pseudocode):
Vec4f multiplier, addend;
Vec4f…

Steven Robertson
- 473
- 5
- 9
26
votes
2 answers
How are the gather instructions in AVX2 implemented?
Suppose I'm using AVX2's VGATHERDPS - this should load 8 single-precision floats using 8 DWORD indices.
What happens when the data to be loaded exists in different cache-lines? Is the instruction implemented as a hardware loop which fetches…

Anuj Kalia
- 803
- 8
- 16
26
votes
5 answers
How to perform the inverse of _mm256_movemask_epi8 (VPMOVMSKB)?
The intrinsic:
int mask = _mm256_movemask_epi8(__m256i s1)
creates a mask, with its 32 bits corresponding to the most significant bit of each byte of s1. After manipulating the mask using bit operations (BMI2 for example) I would like to perform…

Satya Arjunan
- 575
- 4
- 11
25
votes
3 answers
SSE (SIMD): multiply vector by scalar
A common operation I do in my program is scaling vectors by a scalar (V*s, e.g. [1,2,3,4]*2 == [2,4,6,8]). Is there a SSE (or AVX) instruction to do this, other than first loading the scalar in every position in a vector (e.g. _mm_set_ps(2,2,2,2))…

Hallgeir
- 1,213
- 1
- 14
- 29
25
votes
1 answer
GCC fails to optimize aligned std::array like C array
Here's some code which GCC 6 and 7 fail to optimize when using std::array:
#include
static constexpr size_t my_elements = 8;
class Foo
{
public:
#ifdef C_ARRAY
typedef double Vec[my_elements] alignas(32);
#else
typedef…

John Zwinck
- 239,568
- 38
- 324
- 436
25
votes
2 answers
Expensive to wrap System.Numerics.VectorX - why?
TL;DR: Why is wrapping the System.Numerics.Vectors type expensive, and is there anything I can do about it?
Consider the following piece of code:
[MethodImpl(MethodImplOptions.NoInlining)]
private static long GetIt(long a, long b)
{
var x =…

Krumelur
- 31,081
- 7
- 77
- 119
24
votes
4 answers
How to move 128-bit immediates to XMM registers
There already is a question on this, but it was closed as "ambiguous" so I'm opening a new one - I've found the answer, maybe it will help others too.
The question is: how do you write a sequence of assembly code to initialize an XMM register with a…

Virgil
- 3,022
- 2
- 19
- 36