assembly intrinsic to do a masked load

Question

int main()
{
    const int STRIDE=2,SIZE=8192;
    int i=0;
    double u[SIZE][STRIDE]; 
    #pragma vector aligned
    for(i=0;i<SIZE;i++)
    {
        u[i][STRIDE-1]= i;
    }
    printf("%lf\n",u[7][STRIDE-1]);
    return 0;
}

The compiler uses xmm registers here. There is stride 2 access and I want to make the compiler ignore this and do a regular load of memory and then mask alternate bits so I would be using 50% of the SIMD registers. I need intrinsics which can be used to load and then mask the register bitwise before storing back to memory

P.S: I have never done assembly coding before

Note that there is a mistake in your code. You access `u[i][STRIDE]`, which is the same as `u[i][2]`. The `2` is wrong: you can only access `u[i][0]` or `u[i][1]`. The access to `u[i][2]` probably goes to `u[i+1][0]`, except when `i==SIZE-1`, where it accesses beyond the end of the array. — William Morris, Nov 03 '12 at 02:17
thanks for pointing that out. was in the fortran world for quite sometime so got rusty with 'C' — arunmoezhi, Nov 03 '12 at 06:12

score 2 · Accepted Answer · answered Nov 07 '12 at 22:38

2

A masked store with a mask value as 0xAA (10101010)

answered Nov 07 '12 at 22:38

arunmoezhi

3,082
6
35
54

score 0 · Answer 2 · answered Nov 03 '12 at 10:00

0

You can't do a masked load (only a masked store). The easiest alternative would be to do a load and then mask it yourself (e.g. using intrinsics).

A potentially better alternative would be to change your array to "double u[STRIDE][SIZE];" so that you don't need to mask anything and don't end up with half an XMM register wasted/masked.

answered Nov 03 '12 at 10:00

Brendan

35,656
2
39
66

when there is stride 2 access, it doesn't matter if you use float or double you will end up using only 50% of the registers if this method is used. Also the array structure is `u[SIZE][STRIDE]` – arunmoezhi Nov 03 '12 at 16:58

score 0 · Answer 3 · answered Nov 12 '12 at 17:55

Without AVX, half a SIMD register is only one double anyway, so there seems little wrong with regular 64-bit stores.

If you want to use masked stores (MASKMOVDQU/MASKMOVQ), note that they write directly to DRAM just like the non-temporal stores like MOVNTPS. This may or may not be what you want. If the data fits in cache and you plan to read it soon, it is likely better not to use them.

Certain AMD processors can do a 64-bit non-temporal store from an XMM register using MOVNTSD; this may simplify things slightly compared to MASKMOVDQU).

assembly intrinsic to do a masked load

3 Answers3