Cuda - nvcc - No kernel image is available for execution on the device. What is the problem?

Question

I'm trying to use nvcc with the most simple example, but it doesn't work correctly. I'm compiling and execute the example from https://devblogs.nvidia.com/easy-introduction-cuda-c-and-c/, however my server can't execute the global function. I rewrite the code to get some error message and I receive the following message: "no kernel image is available for execution on the device"

My GPU is a Quadro 6000 and the cuda version is 9.0.

#include <stdio.h>
#include <cuda_runtime.h>

__global__ void saxpy(int n, float a, float *x, float *y)
{
  int i = blockIdx.x*blockDim.x + threadIdx.x;
  y[i] = 10.0; //a*x[i] + y[i];  
}

int main(int argc, char *argv[])
{
  int N = 120;
  int nDevices;
  float *x, *y, *d_x, *d_y;

  cudaError_t err = cudaGetDeviceCount(&nDevices);
  if (err != cudaSuccess) 
    printf("%s\n", cudaGetErrorString(err));
  else
    printf("Number of devices %d\n", nDevices);

  x = (float*)malloc(N*sizeof(float));
  y = (float*)malloc(N*sizeof(float));

  cudaMalloc(&d_x, N*sizeof(float)); 
  cudaMalloc(&d_y, N*sizeof(float));

  for (int i = 0; i < N; i++) {
    x[i] = 1.0f;
    y[i] = 2.0f;
  }

  cudaMemcpy(d_x, x, N*sizeof(float), cudaMemcpyHostToDevice);
  cudaMemcpy(d_y, y, N*sizeof(float), cudaMemcpyHostToDevice);

  // Perform SAXPY on 1M elements  
  saxpy<<<1, 1>>>(N, 2.0f, d_x, d_y);
  cudaDeviceSynchronize(); 

  err = cudaMemcpy(y, d_y, N*sizeof(float), cudaMemcpyDeviceToHost);  

  printf("%s\n",cudaGetErrorString(err));

  cudaError_t errSync  = cudaGetLastError();
  cudaError_t errAsync = cudaDeviceSynchronize();
  if (errSync != cudaSuccess) 
    printf("Sync kernel error: %s\n", cudaGetErrorString(errSync));
  if (errAsync != cudaSuccess)
    printf("Async kernel error: %s\n", cudaGetErrorString(errAsync)); 


  cudaFree(d_x);
  cudaFree(d_y);
  free(x);
  free(y);
}"

Execution command

bash-4.1$ nvcc  -o sapx simples_cuda.cu
bash-4.1$ ./sapx
Number of devices 1
no error
Sync kernel error: no kernel image is available for execution on the device

score 7 · Answer 1 · answered Apr 08 '19 at 14:35

GPUs of compute capability less than 2.0 are only supported by CUDA toolkits of version 6.5 and older.

GPUs of compute capability less than 3.0 (but greater than or equal to 2.0) are only supported by CUDA toolkits of version 8.0 and older.

Your Quadro 6000 is a compute capability 2.0 GPU. This can be determined programmatically with the deviceQuery CUDA sample code, or via a google search. It is not supported by CUDA 9.0

score 3 · Answer 2 · answered Sep 15 '20 at 17:17

You should add the compute capability of your Video Card as a parameter to the nvcc compiler. In my case (windows/Visual Studio 2017) I set this at the Code Generation field. So as @einpoklum answered before add the gencode parameter like this -gencode arch=compute_${COMPUTE_CAPABILITY},code=compute_${SM_CAPABILITY} where {COMPUTE_CAPABILITY} and {SM_CAPABILITY} belong to the following pairs (you can add them all as VS2017 do),

{COMPUTE_CAPABILITY},{SM_CAPABILITY}    

compute_35,sm_35
compute_37,sm_37
compute_50,sm_50
compute_52,sm_52
compute_60,sm_60
compute_61,sm_61
compute_70,sm_70
compute_75,sm_75
compute_80,sm_80

D:\Program Files\nVidia\CUDA Samples\MySamples\IntroToCUDA_1\IntroToCUDA_1>"D:\Program Files\nVidia\GPU Computing Toolkit\CUDA\v11.0\bin\nvcc.exe" -gencode=arch=compute_35,code=\"sm_35,compute_35\" -gencode=arch=compute_37,code=\"sm_37,compute_37\" -gencode=arch=compute_50,code=\"sm_50,compute_50\" -gencode=arch=compute_52,code=\"sm_52,compute_52\" -gencode=arch=compute_60,code=\"sm_60,compute_60\" -gencode=arch=compute_61,code=\"sm_61,compute_61\" -gencode=arch=compute_70,code=\"sm_70,compute_70\" -gencode=arch=compute_75,code=\"sm_75,compute_75\" -gencode=arch=compute_80,code=\"sm_80,compute_80\" --use-local-env -ccbin "D:\Program Files (x86)\Microsoft Visual Studio\2017\Enterprise\VC\Tools\MSVC\14.16.27023\bin\HostX86\x64" -x cu   -I"D:\Program Files\nVidia\GPU Computing Toolkit\CUDA\v11.0\include" -I"D:\Program Files\nVidia\GPU Computing Toolkit\CUDA\v11.0\include"  -G   --keep-dir x64\Debug -maxrregcount=0  --machine 64 --compile -cudart static  -g   -D_DEBUG -D_CONSOLE -D_UNICODE -DUNICODE -Xcompiler "/EHsc /W3 /nologo /Od /Fdx64\Debug\vc141.pdb /FS /Zi /RTC1 /MDd " -o x64\Debug\IntroToCUDA_1.cu.obj "D:\Program Files\nVidia\CUDA Samples\MySamples\IntroToCUDA_1\IntroToCUDA_1\IntroToCUDA_1.cu"

You can check your CC of your video card with the deviceQuery example you can find in CUDA Samples SDK

einpoklum · Answer 3 · 2021-04-14T10:36:31.773

2

Adding to @RobertCrovella's answer:

When compiling with nvcc, you should always set appropriate flags to generate binary kernel images for the microarchitecture / compute capability you intend to run on. For example: -gencode arch=compute_${COMPUTE_CAPABILITY},code=compute_${COMPUTE_CAPABILITY}, with, say COMPUTE_CAPABILITY=61.

Read nvcc --help for more information on these flags (although, to be honest, it's a bit of a murky subject).

edited Apr 14 '21 at 10:36

answered Apr 08 '19 at 14:41

einpoklum

118,144
57
340
684

your proposed set of switches don't generate a binary kernel image. They generate PTX. – Robert Crovella Apr 08 '19 at 15:59
@RobertCrovella: I didn't mean for those to be the only switches – einpoklum Apr 09 '19 at 11:19

Cuda - nvcc - No kernel image is available for execution on the device. What is the problem?

3 Answers3