Single instruction, multiple data (SIMD) is the concept of having each instruction operate on a small chunk or vector of data elements. CPU vector instruction sets include: x86 SSE and AVX, ARM NEON, and PowerPC AltiVec. To efficiently use SIMD instructions, data needs to be in structure-of-arrays form and should occur in longer streams. Naively "SIMD optimized" code frequently surprises by running slower than the original.
Questions tagged [simd]
2540 questions
2
votes
1 answer
Implementation and performance of using bitsets with SSE
I am trying to speed up my method using SSE (On Visual Studio). I am a novice in the area. The main data types I work with in my method are bitsets of size 32 and the logical operation I mainly use is the AND operation (with _BitScanForward scarcely…

SMir
- 650
- 1
- 7
- 19
2
votes
1 answer
How to count the number of bytes which lies in some range using SSE?
I want to write a c program which counts the number of bytes in a range a...c with below code:
char a[16], b[16], c[16];
int counter = 0;
for(i = 0; i < 16; i++)
{
if((a[i] < b[i]) && (b[i] < c[i]))
counter++;
}
return counter; …

quartz
- 747
- 9
- 26
2
votes
1 answer
How to do aligned additions without aligned arrays
So i was trying to do an array operation that looked something like
for (int i=0;i++i<32)
{
output[offset+i] += input[i];
}
where output and input are float arrays (which are 16-byte aligned thanks to malloc). However, I can't gurantee that…

John Palmer
- 25,356
- 3
- 48
- 67
2
votes
1 answer
Sum of the four 32bits elements of a _m128 vector
I'm using intrinsics to optimize a program of mine. But now I would like to sum the four elements that are in a __m128 vector in order to compare the result to a floating point value. For instance, let's say I have this 128 bits vector : {a, b c,…

Merkil
- 23
- 3
1
vote
3 answers
Can raymarching be accelerated under an SIMD architecture?
The answer would seem to be no, because raymarching is highly conditional i.e. each ray follows a unique execution path, since on each step we check for opacity, termination etc. that will vary based on the direction of the individual ray.
So it…

Engineer
- 8,529
- 7
- 65
- 105
1
vote
1 answer
simd store delay
I have the following type of code
short v[8] __attribute__ (( aligned(16)));
...
// in an inlined function :
_mm_store_si128(v, some_m128i_value);
... // some more operation (4 additions )
outp[0] = v[1] / 2; // <- first access of v since the…

shodanex
- 14,975
- 11
- 57
- 91
1
vote
1 answer
Inline-Assembler-Code in C, copy values from Array to xmm
I have two Arrays and I want to get the dot product.
How do I get the values of vek and vec into xmm0 and xmm1?
And how do I get the Value standing in xmm1 (??) so that I can use it for "printf"?
#include
main(){
float vek[4] = {4.0, 3.0,…

degude
- 365
- 2
- 4
- 10
1
vote
1 answer
How many float multiplies can be performed with a single core of the current Intel architectures?
Trying to assess the performance gain from an embedded architecture I tried to search for the number of floating point multiplies that can be performed in a cycle on a single core of the Core 2 and Core i7 architectures, but could not find a quick…

ysap
- 7,723
- 7
- 59
- 122
1
vote
1 answer
How to overlay images with alpha blending using AVX512 instructions?
I have two images A and B that are stored as byte arrays of ARGB data:
Image A: [a0, r0, g0, b0, a1, r1, g1, b1, ...]
Image B: [a0, r0, g0, b0, a1, r1, g1, b1, ...]
I would like to overlay image B on top of A using the alpha blending formula.
How…

Chris
- 1,501
- 17
- 32
1
vote
1 answer
Why are vectorized computations on integer arrays faster if a smaller-width integer type is used?
I used NumPy to test the differences in execution times of vectorized arithmetic operations on integer arrays of different integer widths. I create 8-bit, 16-bit, 32-bit and 64-bit integer arrays with 100 million random elements each, and then…

Avantgarde
- 111
- 5
1
vote
0 answers
OpenJDK Vector API type conversion issue (Double to Float)
I'm using JDK21 EA to test the Vector API performance.
My original (non-vector) code looks like this:
double[] src;
double divisor;
float[] dst;
for (int i=0; i

Jatinder Sangha
- 11
- 1
1
vote
2 answers
vectorized & in numpy
My use case is to use numpy for bitmap (that is, set operations using bit encoding). I use numpy arrays with uint64. If I have a query with 3 entries, I can then do bitmap | query !=0 to check if any element in the query are in the set. Amazing!
Now…

Guillaume
- 1,277
- 2
- 13
- 21
1
vote
1 answer
Matrix multiplication using simd produces incorrect results when filled with floating point values
I wanted to create a matrix multiplication with simd. Everything is fine, when matrix is filled with some integers. But there are some issues when my matrices are filled with floating point values. The results are not quite correct.
Here is my…

Arheus
- 21
- 3
1
vote
1 answer
_mm512_i32scatter_ps when the indices are repeated
What happens when you call _mm512_i32scatter_ps and the indices repeat? Does it store the sum? Does it just store one? Is it UB? I can't seem to find any documentation on this edge case and I don't want to rely on it if it is UB.
I tried seaching on…

Grogfrognumber47
- 11
- 2
1
vote
0 answers
Use AVX-AVX2 instructions in an AVX512 function
For example, we have a CPU with AVX512bw support.
Now i want to run 3 types of string-length SIMD functions on this CPU.
The first function takes 16 bytes (AVX) of a string and search its characters for the null-terminator, and this continues until…

HelloGUI
- 121
- 7