Questions tagged [simd]

Single instruction, multiple data (SIMD) is the concept of having each instruction operate on a small chunk or vector of data elements. CPU vector instruction sets include: x86 SSE and AVX, ARM NEON, and PowerPC AltiVec. To efficiently use SIMD instructions, data needs to be in structure-of-arrays form and should occur in longer streams. Naively "SIMD optimized" code frequently surprises by running slower than the original.

2540 questions

votes

1 answer

Implementation and performance of using bitsets with SSE

I am trying to speed up my method using SSE (On Visual Studio). I am a novice in the area. The main data types I work with in my method are bitsets of size 32 and the logical operation I mainly use is the AND operation (with _BitScanForward scarcely…

asked May 29 '12 at 15:28

SMir

votes

1 answer

How to count the number of bytes which lies in some range using SSE?

I want to write a c program which counts the number of bytes in a range a...c with below code: char a[16], b[16], c[16]; int counter = 0; for(i = 0; i < 16; i++) { if((a[i] < b[i]) && (b[i] < c[i])) counter++; } return counter; …

x86 sse simd

asked May 15 '12 at 21:38

quartz

votes

1 answer

How to do aligned additions without aligned arrays

So i was trying to do an array operation that looked something like for (int i=0;i++i<32) { output[offset+i] += input[i]; } where output and input are float arrays (which are 16-byte aligned thanks to malloc). However, I can't gurantee that…

c sse simd

asked Apr 24 '12 at 02:55

John Palmer

25,356
3
48
67

votes

1 answer

Sum of the four 32bits elements of a _m128 vector

I'm using intrinsics to optimize a program of mine. But now I would like to sum the four elements that are in a __m128 vector in order to compare the result to a floating point value. For instance, let's say I have this 128 bits vector : {a, b c,…

sum simd sse2 sse3

asked Apr 15 '12 at 16:05

Merkil

vote

3 answers

Can raymarching be accelerated under an SIMD architecture?

The answer would seem to be no, because raymarching is highly conditional i.e. each ray follows a unique execution path, since on each step we check for opacity, termination etc. that will vary based on the direction of the individual ray. So it…

parallel-processing real-time gpu simd raytracing

asked Feb 05 '12 at 10:14

Engineer

8,529
7
65
105

vote

1 answer

simd store delay

I have the following type of code short v[8] __attribute__ (( aligned(16))); ... // in an inlined function : _mm_store_si128(v, some_m128i_value); ... // some more operation (4 additions ) outp[0] = v[1] / 2; // <- first access of v since the…

c gcc sse simd

asked Dec 08 '11 at 17:17

shodanex

14,975
11
57
91

vote

1 answer

Inline-Assembler-Code in C, copy values from Array to xmm

I have two Arrays and I want to get the dot product. How do I get the values of vek and vec into xmm0 and xmm1? And how do I get the Value standing in xmm1 (??) so that I can use it for "printf"? #include main(){ float vek[4] = {4.0, 3.0,…

x86 sse simd sse4

asked Nov 18 '11 at 13:37

degude

vote

1 answer

How many float multiplies can be performed with a single core of the current Intel architectures?

Trying to assess the performance gain from an embedded architecture I tried to search for the number of floating point multiplies that can be performed in a cycle on a single core of the Core 2 and Core i7 architectures, but could not find a quick…

floating-point parallel-processing core simd cpu-architecture

asked Nov 11 '11 at 01:16

ysap

7,723
7
59
122

vote

1 answer

How to overlay images with alpha blending using AVX512 instructions?

I have two images A and B that are stored as byte arrays of ARGB data: Image A: [a0, r0, g0, b0, a1, r1, g1, b1, ...] Image B: [a0, r0, g0, b0, a1, r1, g1, b1, ...] I would like to overlay image B on top of A using the alpha blending formula. How…

image-processing rust simd alphablending avx512

asked Aug 31 '23 at 16:40

Chris

1,501
17
32

vote

1 answer

Why are vectorized computations on integer arrays faster if a smaller-width integer type is used?

I used NumPy to test the differences in execution times of vectorized arithmetic operations on integer arrays of different integer widths. I create 8-bit, 16-bit, 32-bit and 64-bit integer arrays with 100 million random elements each, and then…

python numpy performance vectorization simd

asked Aug 20 '23 at 10:22

Avantgarde

vote

0 answers

OpenJDK Vector API type conversion issue (Double to Float)

I'm using JDK21 EA to test the Vector API performance. My original (non-vector) code looks like this: double[] src; double divisor; float[] dst; for (int i=0; i

java vector simd

asked Aug 08 '23 at 14:02

Jatinder Sangha

vote

2 answers

vectorized & in numpy

My use case is to use numpy for bitmap (that is, set operations using bit encoding). I use numpy arrays with uint64. If I have a query with 3 entries, I can then do bitmap | query !=0 to check if any element in the query are in the set. Amazing! Now…

python numpy bitmap simd

asked Aug 07 '23 at 15:05

Guillaume

1,277
2
13
21

vote

1 answer

Matrix multiplication using simd produces incorrect results when filled with floating point values

I wanted to create a matrix multiplication with simd. Everything is fine, when matrix is filled with some integers. But there are some issues when my matrices are filled with floating point values. The results are not quite correct. Here is my…

c++ simd intrinsics sse2

asked Aug 03 '23 at 14:07

Arheus

vote

1 answer

_mm512_i32scatter_ps when the indices are repeated

What happens when you call _mm512_i32scatter_ps and the indices repeat? Does it store the sum? Does it just store one? Is it UB? I can't seem to find any documentation on this edge case and I don't want to rely on it if it is UB. I tried seaching on…

simd intrinsics avx512

asked Aug 02 '23 at 04:05

Grogfrognumber47

vote

0 answers

Use AVX-AVX2 instructions in an AVX512 function

For example, we have a CPU with AVX512bw support. Now i want to run 3 types of string-length SIMD functions on this CPU. The first function takes 16 bytes (AVX) of a string and search its characters for the null-terminator, and this continues until…

assembly simd avx avx512

asked Jul 31 '23 at 19:35

HelloGUI

Prev 1 2 3

…

99 100 Next