i am studying cuda c and the source i am using use cuda sample programs specifically matrix multiply at runtime.
i am following the code line by line and try to predict the next step to be sure i understand the code.
during this i found the struct declaration of Matrix which has data member stride .
the whole code has no single line initializing this stride data member.
i used nsight to debug the device code and normal vs debugger to debug host code >>>>>there was surprise:
the host code really does not initialize this data member till program ends successfully.
but nsight shows even before the first kernal line that the stride is initialized.
when i looked at autos window of vs debugger of the call to kernel ,i noticed that the function name line of kernel shows __cuda_0 matrix with same strucure as the program Matrix struct but with initialized stride?????
so i do not know when and who initialized this stride variable on device code???
thanks alot
this is the struct for matrix
typedef struct
{ int width;
int height;
float* elements;
int stride;
} Matrix;
this is the main code which initialize matrix without stride
int main(int argc, char* argv[])
{
Matrix A, B, C;
int a1, a2, b1, b2;
a1 = atoi(argv[1]); /* Height of A */
a2 = atoi(argv[2]); /* Width of A */
b1 = a2; /* Height of B */
b2 = atoi(argv[3]); /* Width of B */
A.height = a1;
A.width = a2;
A.elements = (float*)malloc(A.width * A.height * sizeof(float));
B.height = b1;
B.width = b2;
B.elements = (float*)malloc(B.width * B.height * sizeof(float));
C.height = A.height;
C.width = B.width;
C.elements = (float*)malloc(C.width * C.height * sizeof(float));
for(int i = 0; i < A.height; i++)
for(int j = 0; j < A.width; j++)
A.elements[i*A.width + j] = (rand() % 3);//arc4random
for(int i = 0; i < B.height; i++)
for(int j = 0; j < B.width; j++)
B.elements[i*B.width + j] = (rand() % 2);//arc4random
MatMul(A, B, C);
the whole code is present in :CUDA C Programming Guide chapter 3-2-3
ok i got -4 till now and may be the purpose of question is not clear:
in MatMul host function there are lines which declare and initialize the device copies of matrice used and it uses A.width to initialize the d_A.stride ....
void MatMul(const Matrix A, const Matrix B, Matrix C)
{
// Load A and B to device memory
Matrix d_A;
d_A.width = d_A.stride = A.width;
d_A.height = A.height;
size_t size = A.width * A.height * sizeof(float);
cudaMalloc(&d_A.elements, size);
cudaMemcpy(d_A.elements, A.elements, size, cudaMemcpyHostToDevice);
but when you get to :
// Invoke kernel
dim3 dimBlock(BLOCK_SIZE, BLOCK_SIZE);
dim3 dimGrid(B.width / dimBlock.x, A.height / dimBlock.y);
MatMulKernel<<<dimGrid, dimBlock>>>(d_A, d_B, d_C);
it invoke the MatMulKernel and in this device code "which depends only on device memory" you find these lines :
// Matrix multiplication kernel called by MatMul()
__global__ void MatMulKernel(Matrix A, Matrix B, Matrix C)
{
which takes Matrix A as argument......here i see my confusion reason!!!!
the MatMulKernel uses the name A to refer to d_A matrix passed to it...
so later on at these lines:
// Get sub-matrix Asub of A
Matrix Asub = GetSubMatrix(A, blockRow, m);
it calls another device function called GetSubMatrix passing A which is really d_A to it then in GetSubMatrix code it uses A.stride which is really d_A.stride
__device__ Matrix GetSubMatrix(Matrix A, int row, int col)
{
Matrix Asub;
Asub.width = BLOCK_SIZE;
Asub.height = BLOCK_SIZE;
***Asub.stride = A.stride;***
Asub.elements = &A.elements[A.stride * BLOCK_SIZE * row
+ BLOCK_SIZE * col];
return Asub;
}
So the host code struct really does not initialize A.stride
and there is no hidden mechanism to deduct A.stride from matrix like structre in cuda ..
but the use of name A in both host code and device code for 2 different matrices lead to my confusion.
problem solved.