do cuda kernel sets stride for matrix automatically even if stride not initialized in host code?

Question

i am studying cuda c and the source i am using use cuda sample programs specifically matrix multiply at runtime.
i am following the code line by line and try to predict the next step to be sure i understand the code.
during this i found the struct declaration of Matrix which has data member stride .
the whole code has no single line initializing this stride data member.
i used nsight to debug the device code and normal vs debugger to debug host code >>>>>there was surprise:
the host code really does not initialize this data member till program ends successfully.
but nsight shows even before the first kernal line that the stride is initialized.
when i looked at autos window of vs debugger of the call to kernel ,i noticed that the function name line of kernel shows __cuda_0 matrix with same strucure as the program Matrix struct but with initialized stride?????
so i do not know when and who initialized this stride variable on device code??? thanks alot

this is the struct for matrix


typedef struct 
{   int width;
    int height;
    float* elements;    
    int stride;
 } Matrix;

this is the main code which initialize matrix without stride

int main(int argc, char* argv[])
{
    Matrix A, B, C;
    int a1, a2, b1, b2;

    a1 = atoi(argv[1]); /* Height of A */
    a2 = atoi(argv[2]); /* Width of A */
    b1 = a2; /* Height of B */
    b2 = atoi(argv[3]); /* Width of B */

    A.height = a1;
    A.width = a2;
    A.elements = (float*)malloc(A.width * A.height * sizeof(float));

    B.height = b1;
    B.width = b2;
    B.elements = (float*)malloc(B.width * B.height * sizeof(float));

    C.height = A.height;
    C.width = B.width;
    C.elements = (float*)malloc(C.width * C.height * sizeof(float));

    for(int i = 0; i < A.height; i++)
        for(int j = 0; j < A.width; j++)
            A.elements[i*A.width + j] = (rand() % 3);//arc4random

    for(int i = 0; i < B.height; i++)
        for(int j = 0; j < B.width; j++)
            B.elements[i*B.width + j] = (rand() % 2);//arc4random

    MatMul(A, B, C);

the whole code is present in :CUDA C Programming Guide chapter 3-2-3

ok i got -4 till now and may be the purpose of question is not clear:
in MatMul host function there are lines which declare and initialize the device copies of matrice used and it uses A.width to initialize the d_A.stride ....

 void MatMul(const Matrix A, const Matrix B, Matrix C) 
{
 // Load A and B to device memory
 Matrix d_A;
 d_A.width = d_A.stride = A.width;
 d_A.height = A.height;
 size_t size = A.width * A.height * sizeof(float);
 cudaMalloc(&d_A.elements, size);
 cudaMemcpy(d_A.elements, A.elements, size, cudaMemcpyHostToDevice);

but when you get to :

 // Invoke kernel
 dim3 dimBlock(BLOCK_SIZE, BLOCK_SIZE);
 dim3 dimGrid(B.width / dimBlock.x, A.height / dimBlock.y);
 MatMulKernel<<<dimGrid, dimBlock>>>(d_A, d_B, d_C);

it invoke the MatMulKernel and in this device code "which depends only on device memory" you find these lines :

// Matrix multiplication kernel called by MatMul()
__global__ void MatMulKernel(Matrix A, Matrix B, Matrix C)

{

which takes Matrix A as argument......here i see my confusion reason!!!!
the MatMulKernel uses the name A to refer to d_A matrix passed to it...
so later on at these lines:

    // Get sub-matrix Asub of A
    Matrix Asub = GetSubMatrix(A, blockRow, m);

it calls another device function called GetSubMatrix passing A which is really d_A to it then in GetSubMatrix code it uses A.stride which is really d_A.stride

__device__ Matrix GetSubMatrix(Matrix A, int row, int col) 
{
    Matrix Asub;
    Asub.width    = BLOCK_SIZE;
    Asub.height   = BLOCK_SIZE;
    ***Asub.stride   = A.stride;***
    Asub.elements = &A.elements[A.stride * BLOCK_SIZE * row
                                     + BLOCK_SIZE * col];
    return Asub;
}

So the host code struct really does not initialize A.stride
and there is no hidden mechanism to deduct A.stride from matrix like structre in cuda ..
but the use of name A in both host code and device code for 2 different matrices lead to my confusion.

problem solved.

The most important code to look at seems to be the source for `MatMul()`. And sure enough, (as anyone can check [online](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#shared-memory)), there is the initialisation: `d_A.width = d_A.stride = A.width; d_A.height = A.height;`. — tera, Jul 14 '19 at 13:31
yes i saw this code and this is what makes me very surprised.... A.width here is rhs expression and there is no other place in code where it is initialized...so how could it initialize d_A.width or d_A.stride.........also if you try to debug it you will find that in host code it stays uninitialized till end of program.....so how did the device code got value for A.stride?????? — ahmed allam, Jul 15 '19 at 04:47
If you look closely you will notice that, while `d_A.stride` does appear on the right hand side of an equals sign, it is also to the left of another one, which [makes it a valid assignment](https://stackoverflow.com/questions/7244114/two-equal-signs-in-one-line). — tera, Jul 15 '19 at 09:12

score -2 · Accepted Answer · answered Jul 15 '19 at 05:33

the use of name A to refer to host code matrix and in device code of GetSubMatrix to refer to d_A matrix lead to confusion because the data member stride of struct Matrix is not initialized in host code matrix but it will be initialized in device copy d_A matrix,
and this d_A will be passed to GetSubMatrix by argument named A which has stride defined.
so we have 2 matrices with name A one in host undefined and the other in device which is defined so i had this misunderstanding.

if they changed name of argument in GetSubMatrix from A to any thing else there would not have been confusion about stride data member.

do cuda kernel sets stride for matrix automatically even if stride not initialized in host code?

1 Answers1