1

I'm trying to introduce some CUDA optimizations in one of my projects. But I think I'm doing something wrong here. I want to implement a simple matrix-vector multiplication (result = matrix * vector). But when I want to copy the result back to the host, errors will occur (cudaErrorLaunchFailure). Is there an error in my kernel (matrixVectorMultiplicationKernel) or did I call cudaMemcpy incorrectly? I found no helpful documentation for this kind of error state. I think this completely destroys the state of the GPU because I cannot call any CUDA kernel without getting this error again after the first occurrence.

edit#1: Updated code, following leftaroundabout's advice.

// code
...
Eigen::MatrixXf matrix(M, N); // matrix.data() usually should return a float array
Eigen::VectorXf vector(N);    // same here for vector.data()
Eigen::VectorXf result(M);
... // fill matrix and vector
float* matrixOnDevice = copyMatrixToDevice(matrix.data(), matrix.rows(), matrix.cols());
matrixVectorMultiplication(matrixOnDevice, vector.data(), result.data(), matrix.rows(), cm.cols());
... // clean up

// helper functions
float* copyMatrixToDevice(const float* matrix, int mRows, int mCols)
{
  float* matrixOnDevice;
  const int length = mRows*mCols;
  const int size = length * sizeof(float);
  handleCUDAError(cudaMalloc((void**)&matrixOnDevice, size));
  handleCUDAError(cudaMemcpy(matrixOnDevice, matrix, size, cudaMemcpyHostToDevice));
  return matrixOnDevice;
}

void matrixVectorMultiplication(const float* matrixOnDevice, const float* vector, float* result, int mRows, int mCols)
{
  const int vectorSize = mCols*sizeof(float);
  const int resultSize = mRows*sizeof(float);
  const int matrixLength = mRows*mCols;
  float* deviceVector;
  float* deviceResult;
  handleCUDAError(cudaMalloc((void**)&deviceVector, vectorSize));
  handleCUDAError(cudaMalloc((void**)&deviceResult, resultSize));
  handleCUDAError(cudaMemset(deviceResult, 0, resultSize));
  handleCUDAError(cudaMemcpy(deviceVector, vector, vectorSize, cudaMemcpyHostToDevice));
  int threadsPerBlock = 256;
  int blocksPerGrid = (mRows + threadsPerBlock - 1) / threadsPerBlock;
  matrixVectorMultiplicationKernel<<<blocksPerGrid, threadsPerBlock>>>(matrixOnDevice, vector, result, mRows, mCols, matrixLength);
  // --- no errors yet ---
  handleCUDAError(cudaMemcpy(result, deviceResult, resultSize, cudaMemcpyDeviceToHost)); // cudaErrorLaunchFailure
  handleCUDAError(cudaFree(deviceVector)); // cudaErrorLaunchFailure
  handleCUDAError(cudaFree(deviceResult)); // cudaErrorLaunchFailure
}

__global__ void matrixVectorMultiplicationKernel(const float* matrix, const float* vector, float* result, int mRows, int mCols, int length)
{
  int row = blockDim.x * blockIdx.x + threadIdx.x;
  if(row < mRows)
  {
    for(int col = 0, mIdx = row*mCols; col < mCols; col++, mIdx++)
      result[row] += matrix[mIdx] * vector[col];
  }
}
alfa
  • 3,058
  • 3
  • 25
  • 36

1 Answers1

3

Your problem is that void copyMatrixToDevice(..., float* matrixOnDevice, ...) takes this pointer by value, i.e. it can't "output" the device matrix. You can do it with void copyMatrixToDevice(..., float** matrixOnDevice, ...), called by

copyMatrixToDevice(matrix.data(), &matrixOnDevice, matrix.rows(), matrix.cols());

There is the same problem with result in matrixVectorMultiplication.

In the long term, in C++ you should put a proper class abstraction layer around all of this.

leftaroundabout
  • 117,950
  • 5
  • 174
  • 319
  • OK, usually I should have found the first error on my own (`**matrixOnDevice`). Thanks! This is the reason why I have to pass a (void**) to cudaMalloc. The second advice is not clear to me. cudaMemcpy does not change the adress of `result`. Why is it not sufficient to pass it as float*? Anyway, the errors are still there. It did not solve the problem completely. – alfa Apr 16 '12 at 17:23
  • Right, I didn't properly look at `matrixVectorMultiplication`. That one does indeed work, but you're not being particularly consistent. – leftaroundabout Apr 16 '12 at 17:25
  • OK, I found the last error now, I should call the kernel with adresses that are located on the device... `matrixVectorMultiplicationKernel<<>>(matrixOnDevice, **deviceVector**, **deviceResult**, mRows, mCols, matrixLength);` – alfa Apr 16 '12 at 18:01