Matrix Multiplication using CUDA

Question

I am struck up with Matrix multiplication on CUDA. The resultant product matrix is always zero. I have read some sample codes like matrix multiplication in cuda for resolving my problem, but all in vain.

Apart from erratic result of 0, the maximum size of "Width" (code below) is not even 512. I was not able to debug where the problem lies. May be we can discuss it on StackOverflow.

I am referring "Programming Massively Parallel Processors"

#include<cuda.h>
#include<stdio.h>

int main(void) {
    void MatrixMultiplication(float *, float *, float *, int);
    const int Width = 5;
    float M[Width*Width], N[Width*Width], P[Width*Width];
    for(int i = 0; i < (Width*Width) ; i++) {
        M[i] = 5;
        N[i] = 5;
        P[i] = 0;
    }
    MatrixMultiplication(M, N, P, Width);
    for(int i = 0; i < (Width*Width) ; i++) {
        printf("%d \n", P[i]);
    }
    int quit;
    scanf("%d",&quit);
    return 0;
}

//Matrix multiplication kernel - thread specification
__global__ void MatrixMulKernel(float *Md, float *Nd, float *Pd, int Width) {
    //2D Thread ID
    int tx = threadIdx.x;
    int ty = threadIdx.y;

    //Pvalue stores the Pd element that is computed by the thread
    float Pvalue = 0;

    for(int k = 0; k < Width ; ++k) {
        float Mdelement = Md[ty*Width + k];
        float Ndelement = Nd[k*Width + tx];
        Pvalue += (Mdelement*Ndelement);
    }

    Pd[ty*Width + tx] = Pvalue;
}

void MatrixMultiplication(float *M, float *N, float *P, int Width) {
    int size = Width*Width*sizeof(float);
    float *Md, *Nd, *Pd;

    //Transfer M and N to device memory
    cudaMalloc((void**)&Md, size);
    cudaMemcpy(Md,M,size,cudaMemcpyHostToDevice);
    cudaMalloc((void**)&Nd, size);
    cudaMemcpy(Nd,N,size,cudaMemcpyHostToDevice);

    //Allocate P on the device
    cudaMalloc((void**)&Pd,size);

    //Setup the execution configuration
    dim3 dimBlock(Width,Width);
    dim3 dimGrid(1,1);

    //Launch the device computation threads!
    MatrixMulKernel<<<dimGrid,dimBlock>>>(Md,Nd,Pd,Width);

    //Transfer P from device to host
    cudaMemcpy(P,Pd,size,cudaMemcpyDeviceToHost);

    //Free device matrices
    cudaFree(Md);
    cudaFree(Nd);
    cudaFree(Pd);
}

To get the proper code formatting, you need to indent all code with 4 spaces. You can do this easily by highlighting your code and pressing `Ctrl + K`. — Jeff Mercado, Feb 16 '11 at 20:55
If you needn't stick to your own code, the CUDA C Programming Guide has a wonderful matrix-mul implementation that can handle matrices with other dimensions than powers of two and is optimized using shared memory. Highly recommend it for real world use and for learning. — Dave O., Feb 17 '11 at 15:31
@dave drops of water make a mighty ocean. To reach at a level you have to follow step by step. e.g. If I directly use the code given in Appendix A, I might never come to know what's __syncthreads() - not considering the friendly syntax ;-) — Gaurav Kalra, Feb 17 '11 at 18:48

score 4 · Answer 1 · answered Feb 17 '11 at 14:54

You were doing fine until this point:

for(int i = 0; i < (Width*Width) ; i++) {
    printf("%d \n", P[i]);
}

I changed it to %f (because it's a float) and they all print nicely :)

$ ./test.exe
125.000000
125.000000
125.000000
125.000000
125.000000
125.000000
125.000000
125.000000
125.000000
125.000000
125.000000
125.000000
125.000000
125.000000
125.000000
125.000000
125.000000
125.000000
125.000000
125.000000
125.000000
125.000000
125.000000
125.000000
125.000000

Indeed! Though I figured it out without reading your answer and was just going to post it. — Gaurav Kalra, Feb 17 '11 at 18:50

score 1 · Accepted Answer · answered Feb 17 '11 at 18:54

I figured out what was wrong. Let's analyze it :

Point 1 : The quest to remove the ever monotonic "zero value"

As noted, you must replace printf("%d \n", P[i]); as printf("%f \n", P[i]);

Point 2 : Why the program fails with a value of Width 512 ?

Actually it will fail for even a small value such as 23. Why ? Because 23*23 is > 512 (The maximum number of threads that a GPU can have per block as of today!)

score 0 · Answer 3 · edited Feb 16 '11 at 21:20

0

In your MatrixMulKernel function your for loop is like

for(int k = 0; k < Width ; ++k) 
{
    //rest of code      
}

Instead of Width, you must use Width*Width as your array is of size Width*Width.

edited Feb 16 '11 at 21:20

Jeff Mercado

129,526
32
251
272

answered Feb 16 '11 at 21:04

Algorithmist

6,657
7
35
49

1

The whole point of using CUDA parallelism is to eliminate the computational overhead. In this case, each thread is responsible for only 1 result of the product matrix. One result (element) of product matrix can be found using "Width" iterations. So Width*Width is not going to work in any case. – Gaurav Kalra Feb 16 '11 at 21:22

Matrix Multiplication using CUDA

3 Answers3