Questions tagged [simd]

Single instruction, multiple data (SIMD) is the concept of having each instruction operate on a small chunk or vector of data elements. CPU vector instruction sets include: x86 SSE and AVX, ARM NEON, and PowerPC AltiVec. To efficiently use SIMD instructions, data needs to be in structure-of-arrays form and should occur in longer streams. Naively "SIMD optimized" code frequently surprises by running slower than the original.

2540 questions
2
votes
1 answer

How to achieve 8bit madd using SSE2

Reading from the official Intel C++ Intrinsic Reference, SSE 2 has the following command __m128i _mm_madd_epi16(__m128i a, __m128i b) Multiplies the 8 signed 16-bit integers from a by the 8 signed 16-bit integers from b. Adds the signed 32-bit…
adkalkan
  • 69
  • 1
  • 7
2
votes
1 answer

Build 2^n in double/simd

I am trying to build 2^n using the double representation. The trick is (well) known // tips to calculate 2^n using the exponent of the double IEEE representation union ieee754{ double d; uint32_t i[2]; }; // Converts an unsigned long long…
Timocafé
  • 765
  • 6
  • 18
2
votes
1 answer

SIMD integer store

I am writing a program using SSE instructions to multiply and add integer values. I did the same program with floats but I am missing an instruccion for my integer version. With floats, after I have finished all my operations, I return de values…
Thudor
  • 349
  • 2
  • 7
2
votes
2 answers

When does data move around between SSE registers and the stack?

I'm not exactly sure what happens when I call _mm_load_ps? I mean I know I load an array of 4 floats into a __m128, which I can use to do SIMD accelerated arithmetic and then store them back, but isn't this __m128 data type still on the stack? I…
ulak blade
  • 2,515
  • 5
  • 37
  • 81
2
votes
1 answer

Reverse a string using SSE

How do we reverse a string using using SSE? this concept is new to me so please give me some information about it. The reason is because someone says using SSE will fasten up the code and run-time. I have searched for SSE which is _mm128 but don't…
2
votes
1 answer

Where to initialize SSE constants

My question is about the most efficient place to define __m128/__m128i compile time constants in intrinsics based code. Considering two options: Option A __m128i Foo::DoMasking(const __m128i value) const { //defined in method const __m128i…
Rotem
  • 21,452
  • 6
  • 62
  • 109
2
votes
2 answers

Extract 4 SSE integers to 4 chars

Suppose I have a __m128i containing 4 32-bit integer values. Is there some way I can store it inside a char[4], where the lower char from each int value is stored in a char value? Desired result: r1 r2 r3 …
Rotem
  • 21,452
  • 6
  • 62
  • 109
2
votes
1 answer

Can I store only 96 bit of 128 with SSE instructions?

_mm_store_ps stores (for example) 128 bit in a 4 float elements of an array. Can I store only 96 bit? or rather, only first 3 byte in 3 elements of array? (with SSE instuctions) I explained myself badly: I do not want to mask the bits. I would like…
2
votes
1 answer

how to use arm neon vbit intrinsics?

I don't understand how I differentiate between vbit, vbsl and vbif with neon intrinsics. I need to do the vbit operation but if I use the vbslq instruction from the intrinsics I don't get what I want. For example I have a source vector like…
user1926328
  • 147
  • 2
  • 10
2
votes
1 answer

Unable to activate the SSE instruction set by "-march=native" in gcc or any other flags in Core2 chip

My machine is Core2 microarchitecture and I tried to compile some arithmetic code targeting the SSE instruction set. I searched on the web and official manual, and I believe that all I need to do is to add the flag -march=native, because my chip…
2
votes
2 answers

Which registers do x86/x64 processors use for floating point math?

Does x86/x64 use SIMD register for high precision floating point operations or dedicated FP registers? I mean the high precision version, not regular double precision.
user2341104
2
votes
3 answers

Speeding up Newton's Method for finding nth root

Let me predicate this question with a statement; This code works as intended but it is slow very very slow for what it is. Is there a way to make it the newton method converge faster or a way to set a __m256 var equal to a single float without…
2
votes
1 answer

_mm256_testz_pd not working?

I'm working on Core i7 on Linux and using g++ 4.63. I tried the following code: #include #include int main() { __m256d a = _mm256_set_pd(1,2,3,4); __m256d z = _mm256_setzero_pd(); std::cout << _mm256_testz_pd(a,a) <<…
Ming
  • 365
  • 2
  • 12
2
votes
1 answer

Dynamically allocate SIMD Vector as array of doubles

I'm new to vectors and I've been having a read of the gcc documentation trying to get my head around it. Is it possible to dynamically allocate the size of a vector at run time? It appears as though you have to do this in the typedef like: typedef…
samturner
  • 2,213
  • 5
  • 25
  • 31
2
votes
1 answer

Loading non-contiguous floats using SSE

Is there an Intel SSE instruction which can load floats from (non contiguous) evenly spaced memory addresses? For example given an array A = {0, 1, 2, 3 .... n}, I would like to load into a 128 bit register at once {A[0], A[4], A[8], A[12]},…
jaynp
  • 3,275
  • 4
  • 30
  • 43