Matrix Multiplication giving wrong output

Question

What I am attempting to do is Multiply Matrix A & Matrix B and then from the product matrix I get the index of the maximum value per column. But unfortunately, only the first 128*128 values of the matrix multiplication are correct while others are just garbage. I do not quite understand how this works. I request you to kindly guide me with this ..

#include<stdio.h>
#include "cuda.h"
#include<stdlib.h>

#define blockD 32
const int wA = 128;
const int hA = 4096;    
const int wB = 4096;
const int hB = wA;

main(void){

    void MatrixMultiplication(float *, float *, float *, float *);

    int size_A = wA * hA * sizeof(float);
    int size_B = wB * hB * sizeof(float);
    int size_C = wB * hA * sizeof(float);
    int size_max = 2 * wB * sizeof(float);
    float *M, *N, *P, *C;   

    // allocate memory on the CPU
    M = (float*)malloc(size_A);
    N = (float*)malloc(size_B);
    P = (float*)malloc(size_max);
    C = (float*)malloc(size_C);

    // initialize the matrices
    for (int y=0; y < hA; y++) {
        for (int x=0; x < wA; x++){
            M[y*wA + x] = 32; //x + y*wA; 
       }
    }

    for (int y=0; y<hB; y++) {
        for (int x=0; x<wB; x++){
            N[y*wB + x] = 21; //x + y*wB; 
       }
    }


    MatrixMultiplication(M, N, P, C);

    //Write
    FILE *f1;
    int i,j;
    f1 = fopen("C.txt","w");
    for(i = hA - 2 ; i < hA; i ++){
    for(j = 0; j < wB; j++){
        fprintf(f1,"%d\t",int(C[i*wB + j]));
    }
    fprintf(f1,"\n");
    }
    fclose(f1);

    // free the memory allocated on the CPU
    free( M );
    free( N );
    free( P ); 
    free( C );
    cudaDeviceReset();
    return 0;
}


__device__ void MaxFunction(float* Pd, float* max)
{
 int x = (threadIdx.x + blockIdx.x * blockDim.x);  
 int y = (threadIdx.y + blockIdx.y * blockDim.y); 

 int k = 0;

 int temp = 0; int temp_idx = 0;
 for (k = 0; k < wB; ++k) {
            if(Pd[x*wB + k] > temp){
                temp = Pd[x*wB + k];
                temp_idx = x*wB + k;
            }
  }
  max[y*2 + 0] = temp;
  max[y*2 + 1] = temp_idx;
}


__global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd, float* max)
{
  // declare cache in the shared memory
  __shared__ float Mds[blockD][blockD];
  __shared__ float Nds[blockD][blockD];

  float Pvalue = 0;
  // Loop over the Md and Nd block dimension required to compute the Pd element
  for (int m = (wA * blockD * blockIdx.y), n = (blockD * blockIdx.x); 
                            m < ((wA * blockD * blockIdx.y)+wA-1); 
                                        m += blockD, n += (blockD*hB)){

    // collaboratively loading of Md and Nd blocks into shared memory    
    Mds[threadIdx.y][threadIdx.x] = Md[m + wA * threadIdx.y + threadIdx.x];
    Nds[threadIdx.y][threadIdx.x] = Nd[n + wA * threadIdx.y + threadIdx.x];
    __syncthreads();

    // keep track of the running sum    
    for (int k = 0; k < blockD; k++)
      Pvalue += Mds[threadIdx.y][k] * Nds[k][threadIdx.x];
    __syncthreads();
  }

  // write back to the global memory
  int p = hB * blockD * blockIdx.y + blockD * blockIdx.x;
  Pd[p + hB * threadIdx.y + threadIdx.x] = Pvalue;
  __syncthreads();

  MaxFunction(Pd, max);

}

void MatrixMultiplication(float *M, float *N, float *P, float *C) {

    int size_A = wA * hA * sizeof(float);
    int size_B = wB * hB * sizeof(float);
    int size_C = wB * hA * sizeof(float);
    int size_max = 2 * wB * sizeof(float);
    float *Md, *Nd, *Pd, *max; 

    // allocate memory on the GPU
    cudaMalloc((void**)&Md, size_A);
    cudaMalloc((void**)&Nd, size_B);
    cudaMalloc((void**)&Pd, size_C);
    cudaMalloc((void**)&max, size_max);

    // transfer M and N to device memory
    cudaMemcpy(Md, M, size_A, cudaMemcpyHostToDevice);
    cudaMemcpy(Nd, N, size_B, cudaMemcpyHostToDevice);

    // kernel invocation code
    dim3 dimBlock(blockD, blockD);
    dim3 dimGrid(wA/blockD, hB/blockD);

    //Execute Kernel
    MatrixMulKernel<<<dimGrid, dimBlock>>>( Md, Nd, Pd, max);

    // transfer P from device    
    cudaMemcpy(P, max, size_max, cudaMemcpyDeviceToHost);
    cudaMemcpy(C, Pd, size_C, cudaMemcpyDeviceToHost);

    // free the memory allocated on the GPU
    cudaFree(Md);
    cudaFree(Nd);
    cudaFree(Pd);
    cudaFree(max);
}

This is exactly the same code and the same question as in your earlier question. Please don't repost the same question again. — talonmies, Aug 15 '13 at 13:49
I agree its the same code. But I am unable to find an answer. — Krsrb1, Aug 16 '13 at 07:27
That isn't an excuse to post a duplicate question. The key to getting help is editing your existing question to make it easier to answer. Right now your code appears to have two separate problems - the matrix multiplication and the reduction. Choose a problem. Improve the code - I see no CUDA API error checking at all, for example. Are you even sure the code is actually running to completion? Use the tools provided - the debugger, cuda-memcheck. Improve the question with what you find - [SO] is not a free debugging service where we do your work for you. Help us help you... — talonmies, Aug 16 '13 at 07:52

Robert Crovella · Accepted Answer · 2013-08-18T03:05:32.617

In your code you seem to have more than one problem. One of the problems is, in place of this:

dim3 dimGrid(wA/blockD, hB/blockD);

You should have this:

dim3 dimGrid(wB/blockD, hA/blockD);

Ultimately you need one thread in your grid for each output point. Your formulation was giving you a grid of 4 blocks by 4 blocks, whereas you need a grid of 128 blocks by 128 blocks.

The other problem I found with your code was in these lines in the kernel:

int p = hB * blockD * blockIdx.y + blockD * blockIdx.x;
Pd[p + hB * threadIdx.y + threadIdx.x] = Pvalue;

They are not indexing properly through the output array. Rather than try to sort it out using your scheme, I used this instead:

Pd[(threadIdx.x + (blockIdx.x * blockDim.x)) + ((threadIdx.y + (blockIdx.y * blockDim.y))*(gridDim.x*blockDim.x))] = Pvalue;

When I made the above two changes to your code, I got what I believe are correct results throughout the array. And it took about 32 seconds on my machine to run it. (Note that I haven't tried fixing your original max-finding code -- see below for a better approach.)

Based on your previous question, you seemed to be concerned about speed. If you want to do fast matrix multiply, you should use cublas. The following code shows how to use cublas to multiply two ordinary C-style matrices (they don't have to be square). I've also included a column-max finding kernel that will be fast when the number of columns is large (say, over 500 or so. You have 4096 columns in your example). For small numbers of columns, there may be quicker ways to perform this function, but small numbers of columns also suggests that the overall problem size may be small and so speed (of this piece of code) will not really be an issue.

Here's the code:

#include <stdio.h>
#include <cublas_v2.h>
#define VERBOSE 1
#define nTPB 64
#define ROW_A 4
#define COL_A 4
#define ROW_B COL_A
#define COL_B 4
#define ROW_C ROW_A
#define COL_C COL_B
#define SIZ_A (ROW_A*COL_A)
#define SIZ_B (ROW_B*COL_B)
#define SIZ_C (ROW_C*COL_C)



// error check macros
#define cudaCheckErrors(msg) \
    do { \
        cudaError_t __err = cudaGetLastError(); \
        if (__err != cudaSuccess) { \
            fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \
                msg, cudaGetErrorString(__err), \
                __FILE__, __LINE__); \
            fprintf(stderr, "*** FAILED - ABORTING\n"); \
            exit(1); \
        } \
    } while (0)

// for CUBLAS V2 API
#define cublasCheckErrors(fn) \
    do { \
        cublasStatus_t __err = fn; \
        if (__err != CUBLAS_STATUS_SUCCESS) { \
            fprintf(stderr, "Fatal cublas error: %d (at %s:%d)\n", \
                (int)(__err), \
                __FILE__, __LINE__); \
            fprintf(stderr, "*** FAILED - ABORTING\n"); \
            exit(1); \
        } \
    } while (0)

__global__ void col_max(float *mat, float *max, unsigned int *midx, unsigned int rows, unsigned int cols){
  int idx = threadIdx.x + blockDim.x*blockIdx.x;
  if (idx < cols){
    float tempmax = mat[idx];
    unsigned int tempmidx = 0;
    for (int i = 1; i< rows; i++)
      if (mat[idx + (i*cols)] > tempmax){
        tempmax = mat[idx + (i*cols)];
        tempmidx = i;}
    max[idx] = tempmax;
    midx[idx] = tempmidx;
  }
}

int main(){

  float *h_A, *h_B, *h_C, *d_A, *d_B, *d_C, *h_max, *d_max;
  unsigned int *h_idx, *d_idx;

  h_A = (float *)malloc(SIZ_A*sizeof(float));
  if (h_A==0) {printf("malloc fail\n"); return -1;}
  h_B = (float *)malloc(SIZ_B*sizeof(float));
  if (h_B==0) {printf("malloc fail\n"); return -1;}
  h_C = (float *)malloc(SIZ_C*sizeof(float));
  if (h_C==0) {printf("malloc fail\n"); return -1;}
  h_max = (float *)malloc(COL_C*sizeof(float));
  if (h_max==0) {printf("malloc fail\n"); return -1;}
  h_idx = (unsigned int*)malloc(COL_C*sizeof(unsigned int));

  if (h_idx==0) {printf("malloc fail\n"); return -1;}

  cudaMalloc((void **)&d_A, SIZ_A*sizeof(float));
  cudaMalloc((void **)&d_B, SIZ_B*sizeof(float));
  cudaMalloc((void **)&d_C, SIZ_C*sizeof(float));
  cudaMalloc((void **)&d_max, COL_C*sizeof(float));
  cudaMalloc((void **)&d_idx, COL_C*sizeof(unsigned int));
  cudaCheckErrors("cuda malloc fail");

  // initialize data
  for (int i=0; i< SIZ_A; i++) h_A[i] = (float)(i+1);
  for (int i=0; i< SIZ_B; i++) h_B[i] = (float)(i+2);

  cudaMemcpy(d_A, h_A, SIZ_A*sizeof(float), cudaMemcpyHostToDevice);
  cudaMemcpy(d_B, h_B, SIZ_B*sizeof(float), cudaMemcpyHostToDevice);
  cudaCheckErrors("cuda memcpy 1 fail");
  const float alpha = 1.0f;
  const float beta  = 0.0f;
  cublasHandle_t handle;
  cublasCheckErrors(cublasCreate(&handle));
  // C = A*B
  // due to cublas expecting column-major storage, parameters
  // are scrambled
  cublasCheckErrors(cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, COL_B, ROW_A, COL_A, &alpha, d_B, COL_B, d_A, COL_A, &beta, d_C, COL_C));
  cudaMemcpy(h_C, d_C, SIZ_C*sizeof(float), cudaMemcpyDeviceToHost);
  cudaCheckErrors("cuda memcpy 2 fail");
  col_max<<<(COL_C + nTPB - 1)/nTPB, nTPB>>>(d_C, d_max, d_idx, ROW_C, COL_C);
  cudaCheckErrors("kernel launch fail");
  cudaMemcpy(h_max, d_max, COL_C*sizeof(float), cudaMemcpyDeviceToHost);
  cudaMemcpy(h_idx, d_idx, COL_C*sizeof(unsigned int), cudaMemcpyDeviceToHost);
  cudaCheckErrors("cuda memcpy 3 fail/kernel fail");

  if (VERBOSE){
    printf("A: \n");
    for (int i=0; i< ROW_A; i++){
      for (int j=0; j< COL_A; j++)
        printf("%7.5G", h_A[j+(i*COL_A)]);
      printf("\n");}
    printf("B: \n");
    for (int i=0; i< ROW_B; i++){
      for (int j=0; j< COL_B; j++)
        printf("%7.5G", h_B[j+(i*COL_B)]);
      printf("\n");}
    printf("C = A*B: \n");
    for (int i=0; i< ROW_C; i++){
      for (int j=0; j< COL_C; j++)
        printf("%7.5G", h_C[j+(i*COL_C)]);
      printf("\n");}
    printf("COLUMN MAX:\n");
    for (int i=0; i< COL_C; i++)
      printf("%7.5G", h_max[i]);
    printf("\nCOLUMN MAX IDX:\n");
    for (int i=0; i< COL_C; i++)
      printf("%7d", h_idx[i]);
  }
  printf("\n finished!\n");
  return 0;
}

Here's what I used to compile:

$ nvcc -arch=sm_20 -O3 -o t221 t221.cu -lcublas

And here's the sample output:

$ cuda-memcheck ./t221
========= CUDA-MEMCHECK
A:
      1      2      3      4
      5      6      7      8
      9     10     11     12
     13     14     15     16
B:
      2      3      4      5
      6      7      8      9
     10     11     12     13
     14     15     16     17
C = A*B:
    100    110    120    130
    228    254    280    306
    356    398    440    482
    484    542    600    658
COLUMN MAX:
    484    542    600    658
COLUMN MAX IDX:
      3      3      3      3
 finished!
========= ERROR SUMMARY: 0 errors
$

When I extended my code to handle the same sizes you indicated, (A = 4096x128, B=128x4096) it took about 1 second on my machine. So it's much faster than your code. However, when I take your code and comment out your call to MaxFunction in the kernel, it also only takes about 1 second to compute the matrix multiply result. So if you wanted to keep your matrix multiply code (i.e. not use cublas) you could break the code into 2 kernels, and use your multiply routine in the first kernel with my max-finding routine (col_max) in the second kernel, and also probably get a pretty fast result.

As @talonmies indicated, if you are running on a windows machine, be sure you are aware of the ramifications of windows TDR. (search that in the upper right corner search box if needed)

Worth pointing out that the matrix multiplication code posted in the original question actually works fine. I suspect it is being run on a slow device and hitting a display driver watchdog timer. There is actually no question here, but thanks for posting a sensible answer anyway... — talonmies, Aug 17 '13 at 20:05
I've now edited my answer with my fixes to get the OP's code posted in this question to generate (I think) proper matrix multiply results. I'm pretty convinced the OP's code in this question does not produce correct matrix multiply results. — Robert Crovella, Aug 18 '13 at 02:58
As long as the matrices are square (wA=wB=hB) and round multiples of the tile size (so 32), the matrix multiplication code worked at every size I tried it at from 128 to 4096. It is easy to verify, every entry should be wA*32*21. This comes up over and over again, the SDK matrix multiply code get misused and then questions/complaints get posted about why it doesn't work... — talonmies, Aug 18 '13 at 08:48
Thank you very much to both. The CUBLAS solution takes about 0.842ms — Krsrb1, Aug 21 '13 at 05:42
When I compared it with Matlab - without GPU arrays, MatLab took ~900ms and with gpuArrays, Matlab took ~700ms. But for the above code, when I increased nTPB to 128 the timing was ~680ms Note : Matrix size -> A(4000,128) and B(128,19800) GPU : nVidia GeForce 410m — Krsrb1, Aug 21 '13 at 05:56

Matrix Multiplication giving wrong output

1 Answers1

Linked