Questions tagged [simd]

Single instruction, multiple data (SIMD) is the concept of having each instruction operate on a small chunk or vector of data elements. CPU vector instruction sets include: x86 SSE and AVX, ARM NEON, and PowerPC AltiVec. To efficiently use SIMD instructions, data needs to be in structure-of-arrays form and should occur in longer streams. Naively "SIMD optimized" code frequently surprises by running slower than the original.

2540 questions

votes

6 answers

AVX2 what is the most efficient way to pack left based on a mask?

If you have an input array, and an output array, but you only want to write those elements which pass a certain condition, what would be the most efficient way to do this in AVX2? I've seen in SSE where it was done like…

asked Apr 29 '16 at 07:30

Froglegs

1,095
1
11
21

votes

2 answers

Why is np.dot so much faster than np.sum?

Why is np.dot so much faster than np.sum? Following this answer we know that np.sum is slow and has faster alternatives. For example: In [20]: A = np.random.rand(1000) In [21]: B = np.random.rand(1000) In [22]: %timeit np.sum(A) 3.21 µs ± 270 ns…

python numpy cython simd numba

asked Feb 24 '23 at 11:48

Simd

19,447
42
136
271

votes

1 answer

Difference between MOVDQA and MOVAPS x86 instructions?

I'm looking Intel datasheet: Intel® 64 and IA-32 Architectures Software Developer’s Manual and I can't find the difference between MOVDQA: Move Aligned Double Quadword MOVAPS: Move Aligned Packed Single-Precision In Intel datasheet I can find…

assembly x86 sse simd mov

asked Jul 13 '11 at 11:21

GJ.

10,810
2
45
62

votes

2 answers

CPU SIMD vs GPU SIMD?

GPU uses the SIMD paradigm, that is, the same portion of code will be executed in parallel, and applied to various elements of a data set. However, CPU also uses SIMD, and provide instruction-level parallelism. For example, as far as I know,…

parallel-processing gpu cpu simd

asked Dec 06 '14 at 16:33

Carmellose

4,815
10
38
56

votes

8 answers

Why is strcmp not SIMD optimized?

I've tried to compile this program on an x64 computer: #include int main(int argc, char* argv[]) { return ::std::strcmp(argv[0], "really really really really really really really really really" "really really really really…

c++ sse simd strcmp sse2

asked Oct 27 '14 at 10:59

user1095108

14,119
9
58
116

votes

4 answers

Why vectorizing the loop over 64-bit elements does not have performance improvement over large buffers?

I am investigating the effect of vectorization on the performance of the program. In this regard, I have written following code: #include #include #include #define LEN 10000000 int main(){ struct timeval…

c performance simd icc memory-bandwidth

asked Aug 10 '13 at 06:55

Pouya

1,871
3
20
25

votes

5 answers

How to check if compiled code uses SSE and AVX instructions?

I wrote some code to do a bunch of math, and it needs to go fast, so I need it to use SSE and AVX instructions. I'm compiling it using g++ with the flags -O3 and -march=native, so I think it's using SSE and AVX instructions, but I'm not sure. Most…

c++ assembly x86 g++ simd

asked Dec 19 '17 at 00:21

BadProgrammer99

votes

4 answers

What's missing/sub-optimal in this memcpy implementation?

I've become interested in writing a memcpy() as an educational exercise. I won't write a whole treatise of what I did and didn't think about, but here's some guy's implementation: __forceinline // Since Size is usually known, //…

c optimization x86 simd avx

asked Oct 07 '14 at 22:02

einpoklum

118,144
57
340
684

votes

2 answers

Implementation of __builtin_clz

What is the implementation of GCC's (4.6+) __builtin_clz? Does it correspond to some CPU instruction on Intel x86_64 (AVX)?

c gcc cpu simd

asked Feb 19 '12 at 22:36

Cartesius00

23,584
43
124
195

votes

1 answer

What are the best instruction sequences to generate vector constants on the fly?

"Best" means fewest instructions (or fewest uops, if any instructions decode to more than one uop). Machine-code size in bytes is a tie-breaker for equal insn count. Constant-generation is by its very nature the start of a fresh dependency chain,…

assembly x86 sse simd avx

asked Jan 29 '16 at 12:52

Peter Cordes

328,167
45
605
847

votes

3 answers

Intel AVX: 256-bits version of dot product for double precision floating point variables

The Intel Advanced Vector Extensions (AVX) offers no dot product in the 256-bit version (YMM register) for double precision floating point variables. The "Why?" question have been very briefly treated in another forum (here) and on Stack Overflow…

c++ performance simd avx

asked May 04 '12 at 18:21

gleeen.gould

votes

5 answers

Why ARM NEON not faster than plain C++?

Here is a C++ code: #define ARR_SIZE_TEST ( 8 * 1024 * 1024 ) void cpp_tst_add( unsigned* x, unsigned* y ) { for ( register int i = 0; i < ARR_SIZE_TEST; ++i ) { x[ i ] = x[ i ] + y[ i ]; } } Here is a neon version: void…

c++ arm simd neon cortex-a8

asked Apr 20 '11 at 12:07

Smalti

votes

1 answer

Crash with icc: can the compiler invent writes where none existed in the abstract machine?

Consider the following simple program: #include #include #include void replace(char *str, size_t len) { for (size_t i = 0; i < len; i++) { if (str[i] == '/') { str[i] = '_'; } …

c++ x86 language-lawyer simd icc

asked Feb 04 '19 at 22:00

BeeOnRope

60,350
16
207
386

votes

4 answers

print a __m128i variable

I'm trying to learn to code using intrinsics and below is a code which does addition compiler used: icc #include #include int main() { __m128i a = _mm_set_epi32(1,2,3,4); __m128i b = _mm_set_epi32(1,2,3,4); …

c assembly sse simd intrinsics

asked Nov 06 '12 at 18:34

arunmoezhi

3,082
6
35
54

votes

3 answers

How to write portable simd code for complex multiplicative reduction

I want to write fast simd code to compute the multiplicative reduction of a complex array. In standard C this is: #include complex float f(complex float x[], int n ) { complex float p = 1.0; for (int i = 0; i < n; i++) p *=…

c++ c gcc simd avx

asked Jul 25 '17 at 09:13

Simd

19,447
42
136
271

Prev 1

…

99 100 Next