Single instruction, multiple data (SIMD) is the concept of having each instruction operate on a small chunk or vector of data elements. CPU vector instruction sets include: x86 SSE and AVX, ARM NEON, and PowerPC AltiVec. To efficiently use SIMD instructions, data needs to be in structure-of-arrays form and should occur in longer streams. Naively "SIMD optimized" code frequently surprises by running slower than the original.
Questions tagged [simd]
2540 questions
2
votes
1 answer
Trying to add an __m128 using an and mask in SSE programming
I am trying to use the result of a compare operation to add to an SSE variable. I have just realised that when using the _mm_cmplt_ps operation if the result is true it returns a NAN because 0xffffffff can't be represented which is of no use to…

user1850254
- 2,091
- 3
- 16
- 17
2
votes
1 answer
Using XMVECTOR from DirectXMath as a class member causes a crash only in Release Mode?
I've been trying to use XMVECTOR as a class member for a bounding box, since I do a lot of calculations, but I use the XMFLOAT3 only once per frame, so the bounding box has a method that gives me it's center in a XMFLOAT3, otherwise it stays in a…

ulak blade
- 2,515
- 5
- 37
- 81
2
votes
4 answers
SIMD intrinsics - are they usable on gpus?
I'm wondering if I can use SIMD intrinsics in a GPU code like a CUDA's kernel or openCL one. Is that possible?

Johnny Pauling
- 12,701
- 18
- 65
- 108
2
votes
1 answer
Neon VLD consuming more cycles than what is expected?
I have a simple asm code which loads 12 quad registers of NEON, and have paralleled pairwise add instruction along with the load instruction ( to exploit the dual issue capability). I have verified the code…

nguns
- 440
- 6
- 21
2
votes
1 answer
Forcing automatic vectorization with GCC
Here my very simple question. With ICC I know it is possible to use #pragma SIMD to force vectorization of loops that the compiler chooses not to vectorize. Is there something analogous in GCC? Or, is there any plan to add this feature in a future…

user2047635
- 21
- 2
2
votes
1 answer
xmm instructions - segmentation fault with memory source operand
I'm trying to add 4 numbers to other 4 numbers in assembly language with SSE2 instructions, using XMM registers. I did succeed, but I came over something I didn't understand.
If I do the addition this way:
movdqu xmm0, oword [var1]
movdqu xmm1,…

Catalin Vasile
- 367
- 5
- 17
2
votes
2 answers
SSE operation on 4 arrays of integer size
Sorry for the previous non-descriptive question. Please allow me to rephrase the question again:
The setup:
I need to do ADD and some bit wise operations of 4 32-bit values from 4 arrays at the same time using SSE. All the element in these 4 arrays…

fiftyplus
- 561
- 10
- 18
2
votes
3 answers
assembly intrinsic to do a masked load
int main()
{
const int STRIDE=2,SIZE=8192;
int i=0;
double u[SIZE][STRIDE];
#pragma vector aligned
for(i=0;i

arunmoezhi
- 3,082
- 6
- 35
- 54
2
votes
2 answers
Avoiding invalid memory load with SIMD instructions
I am loading elements from memory using SIMD load instructions, let say using Altivec, assuming aligned addresses:
float X[SIZE];
vector float V0;
unsigned FLOAT_VEC_SIZE = sizeof(vector float);
for (int load_index =0; load_index < SIZE;…

fsheikh
- 416
- 3
- 12
2
votes
2 answers
Fast Saturate and shift two Halfwords in ARM asm
I have two signed 16-bit values in a 32-bit word, and I need to shift them right (divide) on constant value (it can be from 1 to 6) and saturate to byte (0..0xFF).
For example,
0x FFE1 00AA with shift=5 must become 0x 0000 0005;
0x 2345 1234 must…

zxcat
- 2,054
- 3
- 26
- 40
2
votes
0 answers
SSE floating point dot product for dummies
I have read many SO questions about SSE/SIMD (e.g., Getting started with SSE), but I'm still confused by all of it. All I want is a dot product between two double precision floating-point vectors, in C (C99 FWIW). I'm using GCC.
Can someone post a…

purple51
- 319
- 1
- 8
2
votes
1 answer
Are arrays initialized like `float[10][10]` already memory aligned for SIMD/SSE?
I need to optimize my matrix multiplication by using SIMD/Intel SSE. The example code given looks like:
*x = (float*)memalign(16, size * sizeof(float));
However, I am using C++ and [found that][1] I instead of malloc (before doing SIMD), I should…

Jiew Meng
- 84,767
- 185
- 495
- 805
2
votes
1 answer
ROS (Robot Operating System) with SSSE3 flag
I started working with ROS lately and got stuck on one problem. I need to use some classes whick require SSE2, SSE3 and SSSE3 CPU extensions.
I tried to edit the manifest.xml file of my ROS Package like

SolvedForHome
- 152
- 1
- 15
2
votes
1 answer
Is it possible to execute MIMD with OpenCL framework?
Soon enough we will have nVidia GTX 300 that would be able to execute multiple instrucions on multiple data (MIMD). I wonder if OpenCL can execute MIMD?

Roman Kagan
- 10,440
- 26
- 86
- 126
2
votes
1 answer
How to align 16-bit ints for use with SSE intrinsics
I am working with two-dimensional arrays of 16-bit integers defined as
int16_t e[MAX_SIZE*MAX_NODE][MAX_SIZE];
int16_t C[MAX_SIZE][MAX_SIZE];
Where Max_SIZE and MAX_NODE are constant values. I'm not a professional programmer, but somehow with the…

SMir
- 650
- 1
- 7
- 19