Data Size to Instructions per Warp relationship in CUDA

Question

I tried to see the number of instructions executed in a kernel when the size of the data type changed

In order to get a custom sized data structure I created a struct as following,

#define DATABYTES 40

__host__ __device__
struct floatArray
{
    float a[DATABYTES/4];
};

And then created a kernel just to copy the above datatype array from one array to another

__global__
void copy_large_data(floatArray * d_in, floatArray * d_out)
{
    d_out[threadIdx.x] = d_in[threadIdx.x];
}

Then invoked the above kernel for only 32 threads with a single block

floatArray * d_in;
floatArray * d_out;

cudaMalloc(&d_in, 32 * sizeof(floatArray));
cudaMalloc(&d_out, 32 * sizeof(floatArray));

copy_large_data<<<1, 32>>>(d_in, d_out);

When I profile the program with using the nvprof and checked for the instructions per warp I could see that the parameter value changes with the change of the value of DATABYTES.

My questions is, whether the reason for this instruction count increase is due to the array inside the floatArray struct. Because when we call the copy in the kernel, it actually expands and copy each element of the array a inside the floatArray struct, creating a more instructions.

Is there a way to copy the a custom struct variable in a kernel using a single instruction?

havogt · Accepted Answer · 2016-06-17T07:28:01.703

You are right in your assumption that the number of copy instruction increases when you change the size of the array. You can inspect this in PTX code and in assembly as I will show below.

The maximum size of a load/store instruction is 128-bit, see e.g. here. That means for your case you can still improve by a factor of 4 by using float4 instead of a float.

Alternatively you can specify the alignment of your data structure explicitly as explained in the programming guide:

#define DATABYTES 32
struct __align__(16) floatArray
{
    float a[DATABYTES/4];
};

To see the PTX code generate an object file nvcc -c ... and use cubobjdump --dump-ptx objfile.o. For your example the relevant part looks like this:

ld.global.f32 %f1, [%rd7];
ld.global.f32 %f2, [%rd7+4];
ld.global.f32 %f3, [%rd7+8];
ld.global.f32 %f4, [%rd7+12];
ld.global.f32 %f5, [%rd7+16];
ld.global.f32 %f6, [%rd7+20];
ld.global.f32 %f7, [%rd7+24];
ld.global.f32 %f8, [%rd7+28];
ld.global.f32 %f9, [%rd7+32];
ld.global.f32 %f10, [%rd7+36];
st.global.f32 [%rd6+36], %f10;
st.global.f32 [%rd6+32], %f9;
st.global.f32 [%rd6+28], %f8;
st.global.f32 [%rd6+24], %f7;
st.global.f32 [%rd6+20], %f6;
st.global.f32 [%rd6+16], %f5;
st.global.f32 [%rd6+12], %f4;
st.global.f32 [%rd6+8], %f3;
st.global.f32 [%rd6+4], %f2;
st.global.f32 [%rd6], %f1;

If you increase the array further you will find a point where the compiler will choose to loop instead of emitting an instruction for each load/store.

Accordingly you can inspect the assembly with cubobjdump --dump-sass objfile.o

So what you say is that the maximum datasize of the data that we can transfer using a single instruction is 128 bits? — BAdhi, Jun 16 '16 at 15:11
Yes, but keep in mind that transactions within a warp can be coalesced to 128 byte transactions. — havogt, Jun 16 '16 at 15:17
You mean when using the L1 cache right? because I think if we use the L2 cache only, the transaction size is reduced to 32 bytes — BAdhi, Jun 16 '16 at 15:19
Thanks for the info about the `cuobjdump`. I checked the instructions when using `float4` and was able to observe what you mentioned, where only single instruction was given to load all the x,y,z,w variables `ld.global.v4.f32 {%f1, %f2, %f3, %f4}, [%rd7];` — BAdhi, Jun 17 '16 at 04:42
Still if I create a custom struct with 4 floats variables wrapped inside, the ptx code shows that 4 separate instructions are issued to load the complete structure. — BAdhi, Jun 17 '16 at 04:50

Data Size to Instructions per Warp relationship in CUDA

1 Answers1