Single instruction, multiple data (SIMD) is the concept of having each instruction operate on a small chunk or vector of data elements. CPU vector instruction sets include: x86 SSE and AVX, ARM NEON, and PowerPC AltiVec. To efficiently use SIMD instructions, data needs to be in structure-of-arrays form and should occur in longer streams. Naively "SIMD optimized" code frequently surprises by running slower than the original.
Questions tagged [simd]
2540 questions
2
votes
1 answer
How to achieve 8bit madd using SSE2
Reading from the official Intel C++ Intrinsic Reference,
SSE 2 has the following command
__m128i _mm_madd_epi16(__m128i a, __m128i b)
Multiplies the 8 signed 16-bit integers from a by the 8 signed 16-bit integers from b.
Adds the signed 32-bit…

adkalkan
- 69
- 1
- 7
2
votes
1 answer
Build 2^n in double/simd
I am trying to build 2^n using the double representation. The trick is (well) known
// tips to calculate 2^n using the exponent of the double IEEE representation
union ieee754{
double d;
uint32_t i[2];
};
// Converts an unsigned long long…

Timocafé
- 765
- 6
- 18
2
votes
1 answer
SIMD integer store
I am writing a program using SSE instructions to multiply and add integer values. I did the same program with floats but I am missing an instruccion for my integer version.
With floats, after I have finished all my operations, I return de values…

Thudor
- 349
- 2
- 7
2
votes
2 answers
When does data move around between SSE registers and the stack?
I'm not exactly sure what happens when I call _mm_load_ps? I mean I know I load an array of 4 floats into a __m128, which I can use to do SIMD accelerated arithmetic and then store them back, but isn't this __m128 data type still on the stack? I…

ulak blade
- 2,515
- 5
- 37
- 81
2
votes
1 answer
Reverse a string using SSE
How do we reverse a string using using SSE? this concept is new to me so please give me some information about it. The reason is because someone says using SSE will fasten up the code and run-time.
I have searched for SSE which is _mm128 but don't…

Squall Leonahart
- 31
- 2
2
votes
1 answer
Where to initialize SSE constants
My question is about the most efficient place to define __m128/__m128i compile time constants in intrinsics based code.
Considering two options:
Option A
__m128i Foo::DoMasking(const __m128i value) const
{
//defined in method
const __m128i…

Rotem
- 21,452
- 6
- 62
- 109
2
votes
2 answers
Extract 4 SSE integers to 4 chars
Suppose I have a __m128i containing 4 32-bit integer values.
Is there some way I can store it inside a char[4], where the lower char from each int value is stored in a char value?
Desired result:
r1 r2 r3 …

Rotem
- 21,452
- 6
- 62
- 109
2
votes
1 answer
Can I store only 96 bit of 128 with SSE instructions?
_mm_store_ps stores (for example) 128 bit in a 4 float elements of an array.
Can I store only 96 bit? or rather, only first 3 byte in 3 elements of array? (with SSE instuctions)
I explained myself badly: I do not want to mask the bits. I would like…

user2120196
- 61
- 4
2
votes
1 answer
how to use arm neon vbit intrinsics?
I don't understand how I differentiate between vbit, vbsl and vbif with neon intrinsics. I need to do the vbit operation but if I use the vbslq instruction from the intrinsics I don't get what I want.
For example I have a source vector like…

user1926328
- 147
- 2
- 10
2
votes
1 answer
Unable to activate the SSE instruction set by "-march=native" in gcc or any other flags in Core2 chip
My machine is Core2 microarchitecture and I tried to compile some arithmetic code targeting the SSE instruction set. I searched on the web and official manual, and I believe that all I need to do is to add the flag -march=native, because my chip…

user2719257
- 31
- 3
2
votes
2 answers
Which registers do x86/x64 processors use for floating point math?
Does x86/x64 use SIMD register for high precision floating point operations or dedicated FP registers?
I mean the high precision version, not regular double precision.
user2341104
2
votes
3 answers
Speeding up Newton's Method for finding nth root
Let me predicate this question with a statement; This code works as intended but it is slow very very slow for what it is. Is there a way to make it the newton method converge faster or a way to set a __m256 var equal to a single float without…

Mercutio Calviary
- 184
- 10
2
votes
1 answer
_mm256_testz_pd not working?
I'm working on Core i7 on Linux and using g++ 4.63.
I tried the following code:
#include
#include
int main() {
__m256d a = _mm256_set_pd(1,2,3,4);
__m256d z = _mm256_setzero_pd();
std::cout << _mm256_testz_pd(a,a) <<…

Ming
- 365
- 2
- 12
2
votes
1 answer
Dynamically allocate SIMD Vector as array of doubles
I'm new to vectors and I've been having a read of the gcc documentation trying to get my head around it.
Is it possible to dynamically allocate the size of a vector at run time? It appears as though you have to do this in the typedef like:
typedef…

samturner
- 2,213
- 5
- 25
- 31
2
votes
1 answer
Loading non-contiguous floats using SSE
Is there an Intel SSE instruction which can load floats from (non contiguous) evenly spaced memory addresses?
For example given an array A = {0, 1, 2, 3 .... n}, I would like to load into a 128 bit register at once {A[0], A[4], A[8], A[12]},…

jaynp
- 3,275
- 4
- 30
- 43