2
int main()
{
    const int STRIDE=2,SIZE=8192;
    int i=0;
    double u[SIZE][STRIDE]; 
    #pragma vector aligned
    for(i=0;i<SIZE;i++)
    {
        u[i][STRIDE-1]= i;
    }
    printf("%lf\n",u[7][STRIDE-1]);
    return 0;
}

The compiler uses xmm registers here. There is stride 2 access and I want to make the compiler ignore this and do a regular load of memory and then mask alternate bits so I would be using 50% of the SIMD registers. I need intrinsics which can be used to load and then mask the register bitwise before storing back to memory

P.S: I have never done assembly coding before

arunmoezhi
  • 3,082
  • 6
  • 35
  • 54
  • Note that there is a mistake in your code. You access `u[i][STRIDE]`, which is the same as `u[i][2]`. The `2` is wrong: you can only access `u[i][0]` or `u[i][1]`. The access to `u[i][2]` probably goes to `u[i+1][0]`, except when `i==SIZE-1`, where it accesses beyond the end of the array. – William Morris Nov 03 '12 at 02:17
  • thanks for pointing that out. was in the fortran world for quite sometime so got rusty with 'C' – arunmoezhi Nov 03 '12 at 06:12

3 Answers3

2

A masked store with a mask value as 0xAA (10101010)

arunmoezhi
  • 3,082
  • 6
  • 35
  • 54
0

You can't do a masked load (only a masked store). The easiest alternative would be to do a load and then mask it yourself (e.g. using intrinsics).

A potentially better alternative would be to change your array to "double u[STRIDE][SIZE];" so that you don't need to mask anything and don't end up with half an XMM register wasted/masked.

Brendan
  • 35,656
  • 2
  • 39
  • 66
  • when there is stride 2 access, it doesn't matter if you use float or double you will end up using only 50% of the registers if this method is used. Also the array structure is `u[SIZE][STRIDE]` – arunmoezhi Nov 03 '12 at 16:58
0

Without AVX, half a SIMD register is only one double anyway, so there seems little wrong with regular 64-bit stores.

If you want to use masked stores (MASKMOVDQU/MASKMOVQ), note that they write directly to DRAM just like the non-temporal stores like MOVNTPS. This may or may not be what you want. If the data fits in cache and you plan to read it soon, it is likely better not to use them.

Certain AMD processors can do a 64-bit non-temporal store from an XMM register using MOVNTSD; this may simplify things slightly compared to MASKMOVDQU).

jilles
  • 10,509
  • 2
  • 26
  • 39