Single instruction, multiple data (SIMD) is the concept of having each instruction operate on a small chunk or vector of data elements. CPU vector instruction sets include: x86 SSE and AVX, ARM NEON, and PowerPC AltiVec. To efficiently use SIMD instructions, data needs to be in structure-of-arrays form and should occur in longer streams. Naively "SIMD optimized" code frequently surprises by running slower than the original.
Questions tagged [simd]
2540 questions
17
votes
1 answer
How to compare two vectors using SIMD and get a single boolean result?
I have two vectors of 4 integers each and I'd like to use a SIMD command to compare them (say generate a result vector where each entry is 0 or 1 according to the result of the comparison).
Then, I'd like to compare the result vector to a vector of…

N.M
- 685
- 1
- 9
- 22
17
votes
4 answers
Any Lisp extensions for CUDA?
I just noted that one of the first languages for the Connection-Machine of W.D. Hillis was *Lisp, an extension of Common Lisp with parallel constructs. The Connection-Machine was a massively parallel computer with SIMD architecture, much the same as…

Halberdier
- 1,164
- 11
- 15
17
votes
3 answers
Why does does SSE set (_mm_set_ps) reverse the order of arguments
I recently noticed that
_m128 m = _mm_set_ps(0,1,2,3);
puts the 4 floats into reverse order when cast to a float array:
(float*) p = (float*)(&m);
// p[0] == 3
// p[1] == 2
// p[2] == 1
// p[3] == 0
The same happens with a union { _m128 m;…

Inverse
- 4,408
- 2
- 26
- 35
17
votes
1 answer
Do I get a performance penalty when mixing SSE integer/float SIMD instructions
I've used x86 SIMD instructions (SSE1234) in the form of intrinsics quite a lot lately. What I found frustrating is that the SSE ISA has several simple instructions that are available only for floats or only for integers, but in theory should…
user283145
17
votes
1 answer
How do the Conflict Detection instructions make it easier to vectorize loops?
The AVX512CD instruction families are: VPCONFLICT, VPLZCNT and VPBROADCASTM.
The Wikipedia section about these instruction says:
The instructions in AVX-512 conflict detection (AVX-512CD) are
designed to help efficiently calculate conflict-free…

zr.
- 7,528
- 11
- 50
- 84
17
votes
2 answers
Does R leverage SIMD when doing vectorized calculations?
Given a dataframe like this in R:
+---+---+
| X | Y |
+---+---+
| 1 | 2 |
| 2 | 4 |
| 4 | 5 |
+---+---+
If a vectorized operation is performed on this dataframe, like so:
data$Z <- data$X * data$Y
Will this leverage the processor's…

Jochen van Wylick
- 5,303
- 4
- 42
- 64
17
votes
2 answers
Why does GCC or Clang not optimise reciprocal to 1 instruction when using fast-math
Does anyone know why GCC/Clang will not optimist function test1 in the below code sample to simply use just the RCPPS instruction when using the fast-math option? Is there another compiler flag that would generate this code?
typedef float float4…

Chris_F
- 4,991
- 5
- 33
- 63
17
votes
3 answers
practical BigNum AVX/SSE possible?
SSE/AVX registers could be viewed as integer or floating point BigNums. That is, one could neglect that there exist lanes at all. Does there exist an easy way to exploit this point of view and use these registers as BigNums either singly or…

user1095108
- 14,119
- 9
- 58
- 116
17
votes
1 answer
Why can't I specify the calling convention for a constructor(C++)?
In Visual Studio 2013 a new calling convention _vectorcall exists. It is intended for usage with SSE data types that can be passed in SSE registers.
You can specify the calling convention of a member functions like so.
struct Vector{//a 16 byte…

Froglegs
- 1,095
- 1
- 11
- 21
17
votes
5 answers
Fast Vector Math in .NET - What are the options?
My 3D graphics software, written in C# using SlimDX, does a lot of vector operations on the CPU. (In this specific situation, it is not possible to offload the work to the GPU).
How can I make my vector math faster? So far, I have found these…

LTR
- 1,226
- 2
- 17
- 39
16
votes
3 answers
How to dump all the XMM registers in gdb?
I can dump the all the integer registers in gdb with just:
info registers
for the xmm registers (intel) I need a file like:
print $xmm0
print $xmm1
...
print $xmm15
and then source that file. Is there an easier way?

Peeter Joot
- 7,848
- 7
- 48
- 82
16
votes
2 answers
_mm_load_ps vs. _mm_load_pd vs. etc on Intel x86 ISA
What's the difference between the following two lines?
__m128 x = _mm_load_ps((float *) ptr);
__m128 y = _mm_load_pd((double *)ptr);
In other words, why are there so many different _mm_load_xyz instructions, instead of a generic __m128…

user541686
- 205,094
- 128
- 528
- 886
16
votes
2 answers
Explaining the different types in Metal and SIMD
When working with Metal, I find there's a bewildering number of types and it's not always clear to me which type I should be using in which context.
In Apple's Metal Shading Language Specification, there's a pretty clear table of which types are…

kennyc
- 5,490
- 5
- 34
- 57
16
votes
1 answer
Does compiler use SSE instructions for a regular C code?
I see people using -msse -msse2 -mfpmath=sse flags by default hoping that this will improve performance. I know that SSE gets engaged when special vector types are used in the C code. But do these flags make any difference for regular C code? Does…

Jennifer M.
- 1,398
- 1
- 9
- 11
16
votes
1 answer
what's the difference between _mm256_lddqu_si256 and _mm256_loadu_si256
I had been using _mm256_lddqu_si256 based on an example I found online. Later I discovered _mm256_loadu_si256. The Intel Intrinsics guide only states that the lddqu version may perform better when crossing a cache line boundary. What might be the…

Jimbo
- 2,886
- 2
- 29
- 45