Questions tagged [simd]

Single instruction, multiple data (SIMD) is the concept of having each instruction operate on a small chunk or vector of data elements. CPU vector instruction sets include: x86 SSE and AVX, ARM NEON, and PowerPC AltiVec. To efficiently use SIMD instructions, data needs to be in structure-of-arrays form and should occur in longer streams. Naively "SIMD optimized" code frequently surprises by running slower than the original.

2540 questions
14
votes
3 answers

How can I disable vectorization while using GCC?

I am compiling my code using following command: gcc -O3 -ftree-vectorizer-verbose=6 -msse4.1 -ffast-math With this all the optimizations are enabled. But I want to disable vectorization while keeping the other optimizations.
PhantomM
  • 825
  • 6
  • 17
  • 34
14
votes
5 answers

Count each bit-position separately over many 64-bit bitmasks, with AVX but not AVX2

(Related: How to quickly count bits into separate bins in a series of ints on Sandy Bridge? is an earlier duplicate of this, with some different answers. Editor's note: the answers here are probably better. Also, an AVX2 version of a similar…
pktCoder
  • 1,105
  • 2
  • 15
  • 32
14
votes
2 answers

Why doesn't gcc resolve _mm256_loadu_pd as single vmovupd?

I'm writing some AVX code and I need to load from potentially unaligned memory. I'm currently loading 4 doubles, hence I would use intrinsic instruction _mm256_loadu_pd; the code I've written is: __m256d d1 = _mm256_loadu_pd(vInOut + i*4); I've…
Emanuele
  • 1,408
  • 1
  • 15
  • 39
14
votes
2 answers

Constexpr and SSE intrinsics

Most C++ compilers support SIMD(SSE/AVX) instructions with intrisics like _mm_cmpeq_epi32 My problem with this is that this function is not marked as constexpr, although "semantically" there is no reason for this function to not be constexpr since…
NoSenseEtAl
  • 28,205
  • 28
  • 128
  • 277
14
votes
5 answers

Fastest Implementation of Exponential Function Using AVX

I'm looking for an efficient (Fast) approximation of the exponential function operating on AVX elements (Single Precision Floating Point). Namely - __m256 _mm256_exp_ps( __m256 x ) without SVML. Relative Accuracy should be something like ~1e-6, or…
Royi
  • 4,640
  • 6
  • 46
  • 64
14
votes
2 answers

Why does the FMA _mm256_fmadd_pd() intrinsic have 3 asm mnemonics, "vfmadd132pd", "231" and "213"?

Could someone explain to me why there are 3 variants of the fused multiply-accumulate instruction: vfmadd132pd, vfmadd231pd and vfmadd213pd, while there is only one C intrinsics _mm256_fmadd_pd? To make things simple, what is the difference between…
Zheyuan Li
  • 71,365
  • 17
  • 180
  • 248
14
votes
4 answers

Fast 24-bit array -> 32-bit array conversion?

Quick Summary: I have an array of 24-bit values. Any suggestion on how to quickly expand the individual 24-bit array elements into 32-bit elements? Details: I'm processing incoming video frames in realtime using Pixel Shaders in DirectX 10. A…
Clippy
  • 354
  • 3
  • 10
14
votes
2 answers

SIMD latency throughput

On the Intel Intrisics Guide for most instructions, it also has a value for both latency and throughput. Example: __m128i _mm_min_epi32 Performance Architecture Latency Throughput Haswell 1 0.5 Ivy Bridge 1 0.5 Sandy Bridge 1 …
Alexandros
  • 2,160
  • 4
  • 27
  • 52
14
votes
2 answers

Are GPU/CUDA cores SIMD ones?

Let's take the nVidia Fermi Compute Architecture. It says: The first Fermi based GPU, implemented with 3.0 billion transistors, features up to 512 CUDA cores. A CUDA core executes a floating point or integer instruction per clock for a thread. The…
Marc Andreson
  • 3,405
  • 5
  • 35
  • 51
14
votes
3 answers

Find index of maximum element in x86 SIMD vector

I'm thinking of implementing 8-ary heapsort for uint32_t's. To do this I need a function that selects the index of maximum element in a 8-element vector so that I can compare it with parent element and conditionally perform swap and further siftDown…
Wibowit
  • 371
  • 2
  • 11
14
votes
6 answers

SIMD or not SIMD - cross platform

I need some idea how to write a C++ cross platform implementation of a few parallelizable problems in a way so I can take advantage of SIMD (SSE, SPU, etc) if available. As well as I want to be able at run time to switch between SIMD and not…
Aleks
  • 1,177
  • 10
  • 21
14
votes
1 answer

Using __m256d registers

How do you use __m256d? Say I want to use the Intel AVX instruction _mm256_add_pd on a simple Vector3 class with 3-64 bit double precision components (x, y, and z). What is the correct way to use this? Since x, y and z are members of the Vector3…
bobobobo
  • 64,917
  • 62
  • 258
  • 363
14
votes
1 answer

SIMD the following code

How do I SIMIDize the following code in C (using SIMD intrinsics of course)? I am having trouble understanding SIMD intrinsics and this would help a lot: int sum_naive( int n, int *a ) { int sum = 0; for( int i = 0; i < n; i++ ) sum…
user1585869
  • 301
  • 3
  • 11
14
votes
1 answer

Look-Up Table using SIMD

I have a big pixel processing function which I am currently trying to optimize using intrinsic functions. Being an SSE novice, I am not sure how to tackle the part of the code which involves lookup tables. Basically, I am trying to vectorize the…
Rotem
  • 21,452
  • 6
  • 62
  • 109
13
votes
3 answers

SSE multiplication 16 x uint8_t

I want to multiply with SSE4 a __m128i object with 16 unsigned 8 bit integers, but I could only find an intrinsic for multiplying 16 bit integers. Is there nothing such as _mm_mult_epi8?
Roby
  • 2,011
  • 4
  • 28
  • 55