0

I am attempting to use the cuBlas functions cublasSgetrf and cublasSgetri to find the inversion of a square matrix. This portion of code is part of a larger program where I am attempting to minimize any unnecessary memory allocations or copies. As part of my efforts I have been using nvprof to profile the total application and the individual functions. I discovered that when I started to include the sgetrf or sgetri NVPROF would error with: ==7734== Warning: Found 20 invalid records in the result. ==7734== Warning: This can happen if device ran out of memory or if a device kernel was stopped due to an assertion.

I have isolated the offending code and created a working application below.

#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <math.h>
#include <cuda_runtime_api.h>
#include <cublas_v2.h>
#include <math.h>
#define CUDA(call) do {     \
    cudaError_t err = call; \
    if (err != cudaSuccess)                     \
    {                                           \
        printf("CUDA ERROR at line : %d, file : %s, %s\n", __LINE__, __FILE__, cudaGetErrorString(err)); \
        exit(-1);                          \
    }                                      \
    } while(0);


#define cublascall(call)  \                                                                                        
    do \                                                                                                         
    {  \
     cublasStatus_t status = (call); \
     if(CUBLAS_STATUS_SUCCESS != status) { \                                                                                                       
            fprintf(stderr,"CUBLAS Error:\nFile = %s\nLine = %d\nCode = %d\n", __FILE__, __LINE__, status);     \
            cudaDeviceReset(); \
            exit(EXIT_FAILURE); \
        } \     
     } \
     while(0)

void invertMatrixGPU(float* a_i, float* c_o, int n, int ldda,    cublasHandle_t hdl)
{
    int *p = (int *)malloc(n*sizeof(int));
    int *info = (int *)malloc(sizeof(int));
    int batch;
    int INFOh = 0;
    batch = 1;
    float **a = NULL;
    cudaMalloc(a,sizeof(float*));
    *a = a_i;

    float **c = NULL;
    cudaMalloc(c,sizeof(float*));
    *c = c_o;
    // See
      //http://docs.nvidia.com/cuda/pdf/CUDA_Dynamic_Parallelism_Programming_Guide.pdf
 //http://stackoverflow.com/questions/27094612/cublas-matrix-inversion-from-device

     cublascall(cublasSgetrfBatched(hdl, n, a, ldda, p, info, batch));
     cudaMemcpy(&INFOh,info,sizeof(int),cudaMemcpyDeviceToHost);

    if(INFOh != 0)
    {
        fprintf(stderr, "Inversion Failed: Matrix is singular\n");
        cudaDeviceReset();
        exit(EXIT_FAILURE);
    }
    cublascall(cublasSgetriBatched(hdl, n, (const float **)a, ldda, p, c, ldda, info, batch));
    cudaMemcpy(&INFOh,info,sizeof(int),cudaMemcpyDeviceToHost);

    if(INFOh != 0)
    {
        fprintf(stderr, "Inversion Failed: Matrix is singular\n");
        cudaDeviceReset();
        exit(EXIT_FAILURE);
    }
}

int main() {
    // Initialize GPU for CUDA
    CUDA(cudaSetDevice(0));

    cublasHandle_t handle;
    cublasCreate(&handle);

    float *matrix = (float*)malloc(sizeof(float)*4*4);
    for (int i=0;i<16;i++)
    {
        matrix[i] = i;
    }

    float *matrix_d = NULL;
    CUDA(cudaMalloc(&matrix_d,sizeof(float)*4*4));
   CUDA(cudaMemcpy(matrix_d,matrix,sizeof(float)*4*4,cudaMemcpyHostToDevice));
    float *matrix_di = NULL;
    CUDA(cudaMalloc(&matrix_di,sizeof(float)*4*4));

    for (int i = 0;i<10;i++){
        invertMatrixGPU(matrix_d, matrix_di,4,4, handle);
    }
    free(matrix);
    cudaFree(matrix_d);
    cudaFree(matrix_di);
    cublasDestroy(handle);

}

I believe the issue is with the casting of the one dimensional memory allocation of matrix_d to an array of pointers that is passed to cublasSgetrf and cublasSgetri. If this is the issue can anyone recommend a method that minimizes data allocation and copies but still satisfies cublasSgetrf/i requirement for an array of pointers?

cshea
  • 82
  • 1
  • 7
  • Your `a` and `c` arrays must be *device* arrays, not host arrays. Copy them to the device before calling the batched CUBLAS getri routine – talonmies Jul 21 '15 at 13:16
  • Thank you for the pointer. I did not realized that the original code was for compute capability 3.5, and I am working on a 3.0 and 3.2 devices. My question is how can I achieve the *a=a_i as in the original code? a_i is a device pointer to a linear memory space. – cshea Jul 21 '15 at 17:55
  • 1
    A fully-worked example of how to use these cublas functions for inversion of matrices in "ordinary" host-code usage is given [here](http://stackoverflow.com/questions/22887167/cublas-incorrect-inversion-for-matrix-with-zero-pivot/23045191#23045191). It happens to be doing an inversion of a 17x17 matrix for discussion of a historical issue, but that has no impact on the setup or API calling sequence. I'm inclined to mark this question as a duplicate of that one. Are you able to sort out the method with that example? – Robert Crovella Jul 25 '15 at 13:54
  • Yes, that worked. What I was really over looking was the device allocation for p and info. – cshea Jul 26 '15 at 16:14

0 Answers0