cublas matrix inversion from device

Question

I am trying to run a matrix inversion from the device. This logic works fine if called from the host.

Compilation line is as follows (Linux):

nvcc -ccbin g++ -arch=sm_35 -rdc=true simple-inv.cu -o simple-inv -lcublas_device -lcudadevrt

I get the following warning that I cannot seem to resolve. (My GPU is Kepler. I don't know why it is trying to link to Maxwell routines. I have Cuda 6.5-14):

nvlink warning : SM Arch ('sm_35') not found in '/usr/local/cuda/bin/../targets/x86_64-linux/lib/libcublas_device.a:maxwell_sm50_sgemm.o'

The program runs with:

handle 0 n = 3
simple-inv.cu:63 Error [an illegal memory access was encountered]

The test program is as follows:

#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <cuda_runtime.h>
#include <cublas_v2.h>

#define PERR(call) \
  if (call) {\
   fprintf(stderr, "%s:%d Error [%s] on "#call"\n", __FILE__, __LINE__,\
      cudaGetErrorString(cudaGetLastError()));\
   exit(1);\
  }
#define ERRCHECK \
  if (cudaPeekAtLastError()) { \
    fprintf(stderr, "%s:%d Error [%s]\n", __FILE__, __LINE__,\
       cudaGetErrorString(cudaGetLastError()));\
    exit(1);\
  }

__global__ void
inv_kernel(float *a_i, float *c_o, int n)
{ 
  int p[3], info[1], batch;
  cublasHandle_t hdl;
  cublasStatus_t status = cublasCreate_v2(&hdl);
  printf("handle %d n = %d\n", status, n);

  info[0] = 0;
  batch = 1;
  float *a[] = {a_i};
  const float *aconst[] = {a_i};
  float *c[] = {c_o};
  // See
  // http://docs.nvidia.com/cuda/pdf/CUDA_Dynamic_Parallelism_Programming_Guide.pdf
  //http://stackoverflow.com/questions/27094612/cublas-matrix-inversion-from-device

  status = cublasSgetrfBatched(hdl, n, a, n, p, info, batch);
  __syncthreads();
  printf("rf %d info %d\n", status, info[0]);
  status = cublasSgetriBatched(hdl, n, aconst, n, p,
      c, n, info, batch);
  __syncthreads();
  printf("ri %d info %d\n", status, info[0]);

  cublasDestroy_v2(hdl);
  printf("done\n");
}
static void
run_inv(float *in, float *out, int n)
{
  float *a_d, *c_d;

  PERR(cudaMalloc(&a_d, n*n*sizeof(float)));
  PERR(cudaMalloc(&c_d, n*n*sizeof(float)));
  PERR(cudaMemcpy(a_d, in, n*n*sizeof(float), cudaMemcpyHostToDevice));

  inv_kernel<<<1, 1>>>(a_d, c_d, n);

  cudaDeviceSynchronize();
  ERRCHECK;

  PERR(cudaMemcpy(out, c_d, n*n*sizeof(float), cudaMemcpyDeviceToHost));
  PERR(cudaFree(a_d));
  PERR(cudaFree(c_d));
}

int
main(int argc, char **argv)
{
  float c[9];
  float a[] = {
    1,   2,   3,
    0,   4,   5,
    1,   0,   6 };

  run_inv(a, c, 3);
  return 0;
}

I have followed the guide at http://docs.nvidia.com/cuda/cublas/index.html#device-api section 2.1.9, but I suspect I have overlooked something.

Note: Edited on 11/24 to use correct pointer inputs. This still reports illegal memory access inside the kernel.

Line 63 in the code you have posted is whitespace. Where *exactly* in the error occurring in the code? — talonmies, Nov 23 '14 at 21:42
Line 64 during device synch. I must have posted and older output. I suspect during the call to cublasSgetrfBatched. — Bob, Nov 23 '14 at 21:47
`(float**)a_i` looks very suspicious. Surely you mean to pass the address of `a_i` and not its value? — talonmies, Nov 23 '14 at 22:03
Yes I don't understand this except that it works when run this way from the host. I suspect it has something to do with Fortran compatibility that they talk about in the Nvidia reference. I recall somewhere in the reference that the matrix is a contiguous block as I am passing. I have seen other unrelated C programs where the rows are allocated as a vector an then set to positions in the memory region, but I don't think this is what they are looking for in this case. See this example: http://stackoverflow.com/questions/22887167/cublas-incorrect-inversion-for-matrix-with-zero-pivot — Bob, Nov 23 '14 at 22:11
I haven't worked with cublas in a while ... but you are calling cublas from kernel code? Is that possible? Shouldn't you call the matrix inversion routine from host code? I'm thinking cublas will create the proper number of blocks/threads depending on your device. What would happen if you would create let's say 32K threads... all of them will make a call to cublas? — VAndrei, Nov 23 '14 at 22:15
@VAndrei: yes it is possible and your comment is completely irrelevant to the quesiton. — talonmies, Nov 23 '14 at 22:16
@Bob: The code you link to and your code are different, and the difference is that you have an illegal cast. `*a[] = {a_i}; cublasSgetrfBatched(..., a, ....)` and `cublasSgetrfBatched(..., (float**)a_i, ...)` are not equivalent, and if you think they are, then you need to revise the theory of pointers in C++. — talonmies, Nov 23 '14 at 22:22
Good point. I have no idea what will happen when I ramp this up and I don't understand the dynamic parallelism section in the NVidia reference. I was hoping to prove that it works for a single call first. The larger kernel I am using runs on the order of 1e6 blocks with 64 grids to compute a hand written QR decomposition in parallel. This works fine. I need to add a matrix inversion following the QRD. I was hoping that cublas would save me some time in writing it myself. — Bob, Nov 23 '14 at 22:25
@talonmies Thanks. I see it it now. This represents The number of matrices in the batch which is one in this case. Uncommenting the lines above, I get cublas status -4130. This is closer. — Bob, Nov 23 '14 at 22:34

Robert Crovella · Accepted Answer · 2021-05-16T13:18:40.893

NOTE: The ability to call cublas functions from device code has been removed from CUDA as of CUDA 10.0. The description in this answer only pertains to CUDA 9.x usage and prior. See here.

The warnings about sm_50 are benign. That's my way of saying "they can be safely ignored in this case".

Regarding the code you currently have posted, the problem relates to what is described in the dynamic parallelism documentation around the use of thread-local memory here.

In a nutshell, local memory of the parent thread is "out of scope" in a child kernel launch. Although it's not entirely obvious, the cublas calls from device code are (attempting) to launch child kernels. This means that declarations like this:

int p[3], info[1],

will be problematic if those pointers (e.g. p, info) are passed to a child kernel. The numerical values of the pointers themselves will not be corrupted, but they will not point to anything "meaningful" in the memory space of the child kernel.

There are multiple ways to solve this, but one possible solution is to replace any stack/local allocations of this type with allocations from the "device heap" which can be made via in-kernel malloc.

Here is a fully worked code/example that seems to work correctly for me. The output seems to be correct for the inversion of the given sample matrix:

$ cat t605.cu
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <cuda_runtime.h>
#include <cublas_v2.h>

#define PERR(call) \
  if (call) {\
   fprintf(stderr, "%s:%d Error [%s] on "#call"\n", __FILE__, __LINE__,\
      cudaGetErrorString(cudaGetLastError()));\
   exit(1);\
  }
#define ERRCHECK \
  if (cudaPeekAtLastError()) { \
    fprintf(stderr, "%s:%d Error [%s]\n", __FILE__, __LINE__,\
       cudaGetErrorString(cudaGetLastError()));\
    exit(1);\
  }

__global__ void
inv_kernel(float *a_i, float *c_o, int n)
{
  int *p = (int *)malloc(3*sizeof(int));
  int *info = (int *)malloc(sizeof(int));
  int batch;
  cublasHandle_t hdl;
  cublasStatus_t status = cublasCreate_v2(&hdl);
  printf("handle %d n = %d\n", status, n);

  info[0] = 0;
  batch = 1;
  float **a = (float **)malloc(sizeof(float *));
  *a = a_i;
  const float **aconst = (const float **)a;
  float **c = (float **)malloc(sizeof(float *));
  *c = c_o;
  // See
  // http://docs.nvidia.com/cuda/pdf/CUDA_Dynamic_Parallelism_Programming_Guide.pdf
  //http://stackoverflow.com/questions/27094612/cublas-matrix-inversion-from-device
  status = cublasSgetrfBatched(hdl, n, a, n, p, info, batch);
  __syncthreads();
  printf("rf %d info %d\n", status, info[0]);
  status = cublasSgetriBatched(hdl, n, aconst, n, p,
      c, n, info, batch);
  __syncthreads();
  printf("ri %d info %d\n", status, info[0]);
  cublasDestroy_v2(hdl);
  printf("done\n");
}
static void
run_inv(float *in, float *out, int n)
{
  float *a_d, *c_d;

  PERR(cudaMalloc(&a_d, n*n*sizeof(float)));
  PERR(cudaMalloc(&c_d, n*n*sizeof(float)));
  PERR(cudaMemcpy(a_d, in, n*n*sizeof(float), cudaMemcpyHostToDevice));

  inv_kernel<<<1, 1>>>(a_d, c_d, n);

  cudaDeviceSynchronize();
  ERRCHECK;

  PERR(cudaMemcpy(out, c_d, n*n*sizeof(float), cudaMemcpyDeviceToHost));
  PERR(cudaFree(a_d));
  PERR(cudaFree(c_d));
}

int
main(int argc, char **argv)
{
  float c[9];
  float a[] = {
    1,   2,   3,
    0,   4,   5,
    1,   0,   6 };

  run_inv(a, c, 3);
  for (int i = 0; i < 3; i++){
    for (int j = 0; j < 3; j++) printf("%f, ",c[(3*i)+j]);
    printf("\n");}

  return 0;
}
$ nvcc -arch=sm_35 -rdc=true -o t605 t605.cu -lcublas_device -lcudadevrt
nvlink warning : SM Arch ('sm_35') not found in '/shared/apps/cuda/CUDA-v6.5.14/bin/..//lib64/libcublas_device.a:maxwell_sgemm.asm.o'
nvlink warning : SM Arch ('sm_35') not found in '/shared/apps/cuda/CUDA-v6.5.14/bin/..//lib64/libcublas_device.a:maxwell_sm50_sgemm.o'
$ ./t605
handle 0 n = 3
rf 0 info 0
ri 0 info 0
done
1.090909, -0.545455, -0.090909,
0.227273, 0.136364, -0.227273,
-0.181818, 0.090909, 0.181818,
$

For CUDA 10.0 and newer users, I would suggest using ordinary batched cublas functions from host code. In particular for matrix inverse, one option is to use matinvBatched.

Thanks. This works for me. I had originally allocated the p and info variables, but had not realized that I also needed to allocate the a, aconst, and c variables. After reading the local memory reference section this makes sense. I would imagine that n is allocated to the global memory heap since it is part of the kernel call parameters. The handle variable probably does not apply. — Bob, Nov 25 '14 at 23:45
Other parameters like `n`, `batch`, etc., are passed by value. Parameters passed by value have no reference back to the calling environment. This is characteristic of C/C++, not a unique CUDA concept. In fact even the pointers are passed "by value". But when these pointer values get dereferenced in the child kernel, bad things happen. For non-pointer parameters, no such dereferencing is done in the child kernel, and everything works fine. In fact, this pass-by-value occurs actually at the cublas function call (and again, later, at the child kernel launch that occurs under the hood.) — Robert Crovella, Nov 25 '14 at 23:53

XapaJIaMnu · Answer 2 · 2014-11-24T17:00:04.717

Could it be that some CUDA function you are running is only supported by different architecture (Even though the documentation says everything used is. I don't get the compiler warnings if I compile with -arch=sm_50. I don't have an sm_50 capable device to test though...

Furthermore those warnings look like some function asm is not available for your architecture so it was linked to a different architecture asm which is not supported by your device so you get some weird error. I think you should take this nvidia developers who understand more what their compiler is doing.

I have access to Compute 3.5 capable device, but unfortunately only with CUDA v 6.0 and using your example (slightly fixed, to compile (const float *) -> (float *) on line 42) and I don't get any compilation warnings (sadly same results though).

Also as mentioned in the comments:

(float**)a_i

is not making a_i to be of type (float **). You should take the address: &a_i

Changing those didn't help fix the problem but those are some pointers that you can look to explore.

When I compile with SM_50, I get "ptxas info : 'device-function-maxrregcount' is a BETA feature". The other warning goes away. — Bob, Nov 25 '14 at 00:08

cublas matrix inversion from device

2 Answers2

Linked