Why does PTX shows 32 bit load operation for a 128 bit struct assignment?

Question

I defined custom structs of 128 bits like this-

typedef struct dtype{
int val;
int temp2;
int temp3;
int temp4;
}dtype;

Then I performed an assignment :-

dtype temp= h_a[i]; //where h_a is dtype *

I was expecting a 128 bit load but instead PTX showed what appears like a 32 bit load operation-

mul.wide.s32    %rd4, %r18, 16;
add.s64         %rd5, %rd1, %rd4;
ld.global.u32   %r17, [%rd5];

Shouldn't it appear like ld.global.v4.u32 %r17, [%rd5];

Where am I going wrong?

typedef struct __align__(16) dtype{ int val,temp2,temp3,temp4; }dtype; Maybe I am getting the wrong thing. Is this what you are suggesting? — Abhishek Tyagi, Sep 17 '20 at 10:33

talonmies · Accepted Answer · 2020-09-17T15:35:48.817

The compiler will only emit vectorized load or store instructions if the memory is guaranteed to be aligned to the size of the type, and all the elements of the type are used (otherwise the vector instruction will be optimized away to a scalar instruction to save bandwidth).

If you do this:

struct dtype{
int val;
int temp2;
int temp3;
int temp4;
};

struct __align__ (16) adtype{
int val;
int temp2;
int temp3;
int temp4;
};

__global__
void kernel(adtype* x, dtype* y)
{
    adtype lx = x[threadIdx.x];
    dtype ly;
    ly.val = lx.temp4;
    ly.temp2 = lx.temp3;
    ly.temp3 = lx.val;
    ly.temp4 = lx.temp2;

    y[threadIdx.x] = ly;
}

you should get something like this:

visible .entry _Z6kernelP6adtypeP5dtype(
        .param .u64 _Z6kernelP6adtypeP5dtype_param_0,
        .param .u64 _Z6kernelP6adtypeP5dtype_param_1
)
{

        ld.param.u64    %rd1, [_Z6kernelP6adtypeP5dtype_param_0];
        ld.param.u64    %rd2, [_Z6kernelP6adtypeP5dtype_param_1];
        cvta.to.global.u64      %rd3, %rd2;
        cvta.to.global.u64      %rd4, %rd1;
        mov.u32         %r1, %tid.x;
        mul.wide.u32    %rd5, %r1, 16;
        add.s64         %rd6, %rd4, %rd5;
        ld.global.v4.u32        {%r2, %r3, %r4, %r5}, [%rd6];
        add.s64         %rd7, %rd3, %rd5;
        st.global.u32   [%rd7], %r5;
        st.global.u32   [%rd7+4], %r4;
        st.global.u32   [%rd7+8], %r2;
        st.global.u32   [%rd7+12], %r3;
        ret;
}

Here you can clearly see the vectorized load for the aligned type, and the non-vectorized store for the non-aligned type. If the kernel is changed so that the store is to the aligned version:

__global__
void kernel(adtype* x, dtype* y)
{
    dtype ly = y[threadIdx.x];
    adtype lx;
    lx.val = ly.temp4;
    lx.temp2 = ly.temp3;
    lx.temp3 = ly.val;
    lx.temp4 = ly.temp2;

    x[threadIdx.x] = lx;
}

you will get this:

.visible .entry _Z6kernelP6adtypeP5dtype(
        .param .u64 _Z6kernelP6adtypeP5dtype_param_0,
        .param .u64 _Z6kernelP6adtypeP5dtype_param_1
)
{

        ld.param.u64    %rd1, [_Z6kernelP6adtypeP5dtype_param_0];
        ld.param.u64    %rd2, [_Z6kernelP6adtypeP5dtype_param_1];
        cvta.to.global.u64      %rd3, %rd1;
        cvta.to.global.u64      %rd4, %rd2;
        mov.u32         %r1, %tid.x;
        mul.wide.u32    %rd5, %r1, 16;
        add.s64         %rd6, %rd4, %rd5;
        add.s64         %rd7, %rd3, %rd5;
        ld.global.u32   %r2, [%rd6+12];
        ld.global.u32   %r3, [%rd6+8];
        ld.global.u32   %r4, [%rd6+4];
        ld.global.u32   %r5, [%rd6];
        st.global.v4.u32        [%rd7], {%r2, %r3, %r5, %r4};
        ret;
}

Now the aligned type is stored with a vectorized instruction.

[ All code compiled for sm_53 using the default Godbolt toolchain (10.2) ]

score 0 · Answer 2 · answered Sep 17 '20 at 11:04

I am adding an additional point in case anyone happens to be facing the same issue.

{
        dtype temp = h_a[i];                  //Loading data  exactly needed

        sum.val += temp.val;
}

I followed the steps given in the above^^ answer however I was not getting a 128 bit load although the above approach is absolutely correct.

The thing is that the compiler saw that out of the 4 fields of the struct, I was using only 1 field in some addition operation. So it very smartly only loaded the chunk which I needed. So no matter what I tried, I was always getting a 32 bit load.

{
        dtype temp = h_a[i];                  //Loading data  exactly needed

        sum.val += temp.val;
        sum.temp2 += temp.temp2;
        sum.temp3 += temp.temp3;
        sum.temp4 += temp.temp4;
}

A little change. Now I am using all the fields. So the compiler loaded all the fields! Yes, now using the approach given in the above ^^ answer, using __align __(16) I got the correct 128 bit load. Although this maybe very obvious for many people, but I am not a veteran coder. I only use coding in certain places to work out my projects. This was seriously insightful for me and I hope someone gets benefited by this also!

You will be happy to hear that the CUDA compiler has included this optimization for a a dozen years or so. Your note also highlights why it is important to provide a minimal *complete* example in questions. The effect you observed here couldn't reasonably be anticipated by site participants based on the code provided. — njuffa, Sep 17 '20 at 11:16

Why does PTX shows 32 bit load operation for a 128 bit struct assignment?

2 Answers2