Questions tagged [simd]

Single instruction, multiple data (SIMD) is the concept of having each instruction operate on a small chunk or vector of data elements. CPU vector instruction sets include: x86 SSE and AVX, ARM NEON, and PowerPC AltiVec. To efficiently use SIMD instructions, data needs to be in structure-of-arrays form and should occur in longer streams. Naively "SIMD optimized" code frequently surprises by running slower than the original.

2540 questions
2
votes
1 answer

Trying to add an __m128 using an and mask in SSE programming

I am trying to use the result of a compare operation to add to an SSE variable. I have just realised that when using the _mm_cmplt_ps operation if the result is true it returns a NAN because 0xffffffff can't be represented which is of no use to…
user1850254
  • 2,091
  • 3
  • 16
  • 17
2
votes
1 answer

Using XMVECTOR from DirectXMath as a class member causes a crash only in Release Mode?

I've been trying to use XMVECTOR as a class member for a bounding box, since I do a lot of calculations, but I use the XMFLOAT3 only once per frame, so the bounding box has a method that gives me it's center in a XMFLOAT3, otherwise it stays in a…
ulak blade
  • 2,515
  • 5
  • 37
  • 81
2
votes
4 answers

SIMD intrinsics - are they usable on gpus?

I'm wondering if I can use SIMD intrinsics in a GPU code like a CUDA's kernel or openCL one. Is that possible?
Johnny Pauling
  • 12,701
  • 18
  • 65
  • 108
2
votes
1 answer

Neon VLD consuming more cycles than what is expected?

I have a simple asm code which loads 12 quad registers of NEON, and have paralleled pairwise add instruction along with the load instruction ( to exploit the dual issue capability). I have verified the code…
nguns
  • 440
  • 6
  • 21
2
votes
1 answer

Forcing automatic vectorization with GCC

Here my very simple question. With ICC I know it is possible to use #pragma SIMD to force vectorization of loops that the compiler chooses not to vectorize. Is there something analogous in GCC? Or, is there any plan to add this feature in a future…
2
votes
1 answer

xmm instructions - segmentation fault with memory source operand

I'm trying to add 4 numbers to other 4 numbers in assembly language with SSE2 instructions, using XMM registers. I did succeed, but I came over something I didn't understand. If I do the addition this way: movdqu xmm0, oword [var1] movdqu xmm1,…
Catalin Vasile
  • 367
  • 5
  • 17
2
votes
2 answers

SSE operation on 4 arrays of integer size

Sorry for the previous non-descriptive question. Please allow me to rephrase the question again: The setup: I need to do ADD and some bit wise operations of 4 32-bit values from 4 arrays at the same time using SSE. All the element in these 4 arrays…
fiftyplus
  • 561
  • 10
  • 18
2
votes
3 answers

assembly intrinsic to do a masked load

int main() { const int STRIDE=2,SIZE=8192; int i=0; double u[SIZE][STRIDE]; #pragma vector aligned for(i=0;i
arunmoezhi
  • 3,082
  • 6
  • 35
  • 54
2
votes
2 answers

Avoiding invalid memory load with SIMD instructions

I am loading elements from memory using SIMD load instructions, let say using Altivec, assuming aligned addresses: float X[SIZE]; vector float V0; unsigned FLOAT_VEC_SIZE = sizeof(vector float); for (int load_index =0; load_index < SIZE;…
fsheikh
  • 416
  • 3
  • 12
2
votes
2 answers

Fast Saturate and shift two Halfwords in ARM asm

I have two signed 16-bit values in a 32-bit word, and I need to shift them right (divide) on constant value (it can be from 1 to 6) and saturate to byte (0..0xFF). For example, 0x FFE1 00AA with shift=5 must become 0x 0000 0005; 0x 2345 1234 must…
zxcat
  • 2,054
  • 3
  • 26
  • 40
2
votes
0 answers

SSE floating point dot product for dummies

I have read many SO questions about SSE/SIMD (e.g., Getting started with SSE), but I'm still confused by all of it. All I want is a dot product between two double precision floating-point vectors, in C (C99 FWIW). I'm using GCC. Can someone post a…
purple51
  • 319
  • 1
  • 8
2
votes
1 answer

Are arrays initialized like `float[10][10]` already memory aligned for SIMD/SSE?

I need to optimize my matrix multiplication by using SIMD/Intel SSE. The example code given looks like: *x = (float*)memalign(16, size * sizeof(float)); However, I am using C++ and [found that][1] I instead of malloc (before doing SIMD), I should…
Jiew Meng
  • 84,767
  • 185
  • 495
  • 805
2
votes
1 answer

ROS (Robot Operating System) with SSSE3 flag

I started working with ROS lately and got stuck on one problem. I need to use some classes whick require SSE2, SSE3 and SSSE3 CPU extensions. I tried to edit the manifest.xml file of my ROS Package like
SolvedForHome
  • 152
  • 1
  • 15
2
votes
1 answer

Is it possible to execute MIMD with OpenCL framework?

Soon enough we will have nVidia GTX 300 that would be able to execute multiple instrucions on multiple data (MIMD). I wonder if OpenCL can execute MIMD?
Roman Kagan
  • 10,440
  • 26
  • 86
  • 126
2
votes
1 answer

How to align 16-bit ints for use with SSE intrinsics

I am working with two-dimensional arrays of 16-bit integers defined as int16_t e[MAX_SIZE*MAX_NODE][MAX_SIZE]; int16_t C[MAX_SIZE][MAX_SIZE]; Where Max_SIZE and MAX_NODE are constant values. I'm not a professional programmer, but somehow with the…
SMir
  • 650
  • 1
  • 7
  • 19