Questions tagged [simd]

Single instruction, multiple data (SIMD) is the concept of having each instruction operate on a small chunk or vector of data elements. CPU vector instruction sets include: x86 SSE and AVX, ARM NEON, and PowerPC AltiVec. To efficiently use SIMD instructions, data needs to be in structure-of-arrays form and should occur in longer streams. Naively "SIMD optimized" code frequently surprises by running slower than the original.

2540 questions

votes

3 answers

How can I disable vectorization while using GCC?

I am compiling my code using following command: gcc -O3 -ftree-vectorizer-verbose=6 -msse4.1 -ffast-math With this all the optimizations are enabled. But I want to disable vectorization while keeping the other optimizations.

asked Oct 15 '11 at 13:45

PhantomM

votes

5 answers

Count each bit-position separately over many 64-bit bitmasks, with AVX but not AVX2

(Related: How to quickly count bits into separate bins in a series of ints on Sandy Bridge? is an earlier duplicate of this, with some different answers. Editor's note: the answers here are probably better. Also, an AVX2 version of a similar…

c optimization x86 x86-64 simd

asked Mar 09 '19 at 20:13

pktCoder

1,105
2
15
32

votes

2 answers

Why doesn't gcc resolve _mm256_loadu_pd as single vmovupd?

I'm writing some AVX code and I need to load from potentially unaligned memory. I'm currently loading 4 doubles, hence I would use intrinsic instruction _mm256_loadu_pd; the code I've written is: __m256d d1 = _mm256_loadu_pd(vInOut + i*4); I've…

gcc assembly vectorization simd avx

asked Oct 03 '18 at 12:08

Emanuele

1,408
1
15
39

votes

2 answers

Constexpr and SSE intrinsics

Most C++ compilers support SIMD(SSE/AVX) instructions with intrisics like _mm_cmpeq_epi32 My problem with this is that this function is not marked as constexpr, although "semantically" there is no reason for this function to not be constexpr since…

c++ sse simd constexpr intrinsics

asked Aug 16 '18 at 14:59

NoSenseEtAl

28,205
28
128
277

votes

5 answers

Fastest Implementation of Exponential Function Using AVX

I'm looking for an efficient (Fast) approximation of the exponential function operating on AVX elements (Single Precision Floating Point). Namely - __m256 _mm256_exp_ps( __m256 x ) without SVML. Relative Accuracy should be something like ~1e-6, or…

x86 simd avx exponential avx2

asked Feb 19 '18 at 10:08

Royi

4,640
6
46
64

votes

2 answers

Why does the FMA _mm256_fmadd_pd() intrinsic have 3 asm mnemonics, "vfmadd132pd", "231" and "213"?

Could someone explain to me why there are 3 variants of the fused multiply-accumulate instruction: vfmadd132pd, vfmadd231pd and vfmadd213pd, while there is only one C intrinsics _mm256_fmadd_pd? To make things simple, what is the difference between…

assembly x86 simd instruction-set fma

asked Apr 03 '16 at 21:57

Zheyuan Li

71,365
17
180
248

votes

4 answers

Fast 24-bit array -> 32-bit array conversion?

Quick Summary: I have an array of 24-bit values. Any suggestion on how to quickly expand the individual 24-bit array elements into 32-bit elements? Details: I'm processing incoming video frames in realtime using Pixel Shaders in DirectX 10. A…

c bitmap bit-manipulation sse simd

asked Jun 04 '10 at 11:37

Clippy

votes

2 answers

SIMD latency throughput

On the Intel Intrisics Guide for most instructions, it also has a value for both latency and throughput. Example: __m128i _mm_min_epi32 Performance Architecture Latency Throughput Haswell 1 0.5 Ivy Bridge 1 0.5 Sandy Bridge 1 …

c++ performance x86 sse simd

asked Feb 15 '15 at 23:02

Alexandros

2,160
4
27
52

votes

2 answers

Are GPU/CUDA cores SIMD ones?

Let's take the nVidia Fermi Compute Architecture. It says: The first Fermi based GPU, implemented with 3.0 billion transistors, features up to 512 CUDA cores. A CUDA core executes a floating point or integer instruction per clock for a thread. The…

cuda gpu gpgpu simd

asked Feb 02 '15 at 18:06

Marc Andreson

3,405
5
35
51

votes

3 answers

Find index of maximum element in x86 SIMD vector

I'm thinking of implementing 8-ary heapsort for uint32_t's. To do this I need a function that selects the index of maximum element in a 8-element vector so that I can compare it with parent element and conditionally perform swap and further siftDown…

c++ x86 sse simd avx

asked May 11 '14 at 08:38

Wibowit

votes

6 answers

SIMD or not SIMD - cross platform

I need some idea how to write a C++ cross platform implementation of a few parallelizable problems in a way so I can take advantage of SIMD (SSE, SPU, etc) if available. As well as I want to be able at run time to switch between SIMD and not…

c++ metaprogramming functor simd

asked Jan 23 '10 at 08:11

Aleks

1,177
10
21

votes

1 answer

Using __m256d registers

How do you use __m256d? Say I want to use the Intel AVX instruction _mm256_add_pd on a simple Vector3 class with 3-64 bit double precision components (x, y, and z). What is the correct way to use this? Since x, y and z are members of the Vector3…

c++ x86 intel simd avx

asked Oct 13 '12 at 17:45

bobobobo

64,917
62
258
363

votes

1 answer

SIMD the following code

How do I SIMIDize the following code in C (using SIMD intrinsics of course)? I am having trouble understanding SIMD intrinsics and this would help a lot: int sum_naive( int n, int *a ) { int sum = 0; for( int i = 0; i < n; i++ ) sum…

c x86 sse simd

asked Aug 08 '12 at 20:50

user1585869

votes

1 answer

Look-Up Table using SIMD

I have a big pixel processing function which I am currently trying to optimize using intrinsic functions. Being an SSE novice, I am not sure how to tackle the part of the code which involves lookup tables. Basically, I am trying to vectorize the…

c++ sse simd

asked May 14 '12 at 20:48

Rotem

21,452
6
62
109

votes

3 answers

SSE multiplication 16 x uint8_t

I want to multiply with SSE4 a __m128i object with 16 unsigned 8 bit integers, but I could only find an intrinsic for multiplying 16 bit integers. Is there nothing such as _mm_mult_epi8?

x86 sse simd sse4

asked Nov 19 '11 at 11:03

Roby

2,011
4
28
55

Prev 1 2 3

…

99 100 Next