Questions tagged [simd]

Single instruction, multiple data (SIMD) is the concept of having each instruction operate on a small chunk or vector of data elements. CPU vector instruction sets include: x86 SSE and AVX, ARM NEON, and PowerPC AltiVec. To efficiently use SIMD instructions, data needs to be in structure-of-arrays form and should occur in longer streams. Naively "SIMD optimized" code frequently surprises by running slower than the original.

2540 questions

votes

2 answers

How to implement atoi using SIMD?

I'd like to try writing an atoi implementation using SIMD instructions, to be included in RapidJSON (a C++ JSON reader/writer library). It currently has some SSE2 and SSE4.2 optimizations in other places. If it's a speed gain, multiple atoi results…

asked Feb 01 '16 at 09:33

the_drow

18,571
25
126
193

votes

5 answers

Good portable SIMD library

can anyone recommend portable SIMD library that provides a c/c++ API, works on Intel and AMD extensions and Visual Studio, GCC compatible. I'm looking to speed up things like scaling a 512x512 array of doubles. Vector dot products, matrix…

c++ open-source cross-platform simd

asked Jun 11 '09 at 15:24

Budric

3,599
8
35
38

votes

1 answer

Is my understanding of AoS vs SoA advantages/disadvantages correct?

I've recently been reading about AoS vs SoA structure design and data-oriented design. It's oddly difficult to find information about either, and what I have found seems to assume greater understanding of processor functionality than I possess. That…

caching memory sse simd data-oriented-design

asked Oct 20 '16 at 20:14

P...

votes

5 answers

Get member of __m128 by index?

I've got some code, originally given to me by someone working with MSVC, and I'm trying to get it to work on Clang. Here's the function that I'm having trouble with: float vectorGetByIndex( __m128 V, unsigned int i ) { assert( i <= 3 ); …

c++ clang sse simd intrinsics

asked Sep 27 '12 at 15:06

benwad

6,414
10
59
93

votes

3 answers

How can I exchange the low 128 bits and high 128 bits in a 256 bit AVX (YMM) register

I am porting SSE SIMD code to use the 256 bit AVX extensions and cannot seem to find any instruction that will blend/shuffle/move the high 128 bits and the low 128 bits. The backing story: What I really want is VHADDPS/_mm256_hadd_ps to act like…

x86 simd avx

asked Aug 26 '11 at 20:08

Mark Borgerding

8,117
4
30
51

votes

19 answers

How fast can you make linear search?

I'm looking to optimize this linear search: static int linear (const int *arr, int n, int key) { int i = 0; while (i < n) { if (arr [i] >= key) break; ++i; } …

c search optimization simd linear-search

asked Apr 30 '10 at 01:50

Mark Probst

7,107
7
40
42

votes

3 answers

How to efficiently perform double/int64 conversions with SSE/AVX?

SSE2 has instructions for converting vectors between single-precision floats and 32-bit integers. _mm_cvtps_epi32() _mm_cvtepi32_ps() But there are no equivalents for double-precision and 64-bit integers. In other words, they are…

c++ floating-point sse simd avx

asked Dec 14 '16 at 14:09

plasmacel

8,183
7
53
101

votes

2 answers

SIMD and difference between packed and scalar double precision

I am reading Intel's intrinsics guide while implementing SIMD support. I have a few confusions and my questions are as below. __m128 _mm_cmpeq_ps (__m128 a, __m128 b) documentation says it is used to compare packed single precision floating points.…

c++ x86 sse simd intrinsics

asked Apr 25 '13 at 15:20

user1461001

votes

2 answers

Haskell math performance on multiply-add operation

I'm writing a game in Haskell, and my current pass at the UI involves a lot of procedural generation of geometry. I am currently focused on identifying performance of one particular operation (C-ish pseudocode): Vec4f multiplier, addend; Vec4f…

performance math haskell simd

asked Jun 25 '10 at 04:28

Steven Robertson

votes

2 answers

How are the gather instructions in AVX2 implemented?

Suppose I'm using AVX2's VGATHERDPS - this should load 8 single-precision floats using 8 DWORD indices. What happens when the data to be loaded exists in different cache-lines? Is the instruction implemented as a hardware loop which fetches…

intel ram simd avx avx2

asked Feb 14 '14 at 08:39

Anuj Kalia

votes

5 answers

How to perform the inverse of _mm256_movemask_epi8 (VPMOVMSKB)?

The intrinsic: int mask = _mm256_movemask_epi8(__m256i s1) creates a mask, with its 32 bits corresponding to the most significant bit of each byte of s1. After manipulating the mask using bit operations (BMI2 for example) I would like to perform…

c x86 simd avx avx2

asked Feb 07 '14 at 07:55

Satya Arjunan

votes

3 answers

SSE (SIMD): multiply vector by scalar

A common operation I do in my program is scaling vectors by a scalar (V*s, e.g. [1,2,3,4]*2 == [2,4,6,8]). Is there a SSE (or AVX) instruction to do this, other than first loading the scalar in every position in a vector (e.g. _mm_set_ps(2,2,2,2))…

c x86 sse simd

asked Jan 31 '12 at 12:35

Hallgeir

1,213
1
14
29

votes

1 answer

GCC fails to optimize aligned std::array like C array

Here's some code which GCC 6 and 7 fail to optimize when using std::array: #include static constexpr size_t my_elements = 8; class Foo { public: #ifdef C_ARRAY typedef double Vec[my_elements] alignas(32); #else typedef…

c++ gcc optimization simd memory-alignment

asked Apr 27 '17 at 08:00

John Zwinck

239,568
38
324
436

votes

2 answers

Expensive to wrap System.Numerics.VectorX - why?

TL;DR: Why is wrapping the System.Numerics.Vectors type expensive, and is there anything I can do about it? Consider the following piece of code: [MethodImpl(MethodImplOptions.NoInlining)] private static long GetIt(long a, long b) { var x =…

c# simd ryujit

asked Jan 04 '16 at 21:31

Krumelur

31,081
7
77
119

votes

4 answers

How to move 128-bit immediates to XMM registers

There already is a question on this, but it was closed as "ambiguous" so I'm opening a new one - I've found the answer, maybe it will help others too. The question is: how do you write a sequence of assembly code to initialize an XMM register with a…

assembly x86 sse simd

asked Jul 11 '11 at 17:38

Virgil

3,022
2
19
36

Prev 1 2

…

99 100 Next