CUDA GPU global function does not complete

Question

__global__ void functionA()
{
 printf("functionA");
}

int main()
{
  printf("main1");
  functionA<<<1,1>>>();
  printf("main2");
}

I'm trying to run a simple test with the above. But the program only outputs "main1". The program should output "functionA" and "main2" too.

Can GPU code "print"? I didn't think GPU cores had access to the console (I thought they were individual cores without access to the rest of the world). — Thomas Matthews, Apr 23 '21 at 16:43

2b-t · Accepted Answer · 2021-04-23T21:45:56.550

This seems to have two reasons:

First of all you need to add
```
cudaDeviceSynchronize();
```
after the CUDA routine in order to block the main until the device has completed all tasks.
Furthermore this might happen if you set the wrong GPU architecture/compute capability XX when compiling the code
```
$ nvcc -gencode=arch=compute_XX,code=sm_XX -o my_app my_app.cu
```
In this case only the host code is run while the parts on the accelerator will be omitted it seems. You can find an overview of the corresponding number XX for the different hardware generations over here. The K20m you are running is 35. So it should be
```
$ nvcc -gencode=arch=compute_35,code=sm_35 -o my_app my_app.cu
```
in your case.

This might also occur if you have multiple graphic accelerators in your system and the code is executed on the wrong one. Each graphics card/accelerator is assigned a particular device id. The device with number 0 should be assigned automatically to the most powerful device and will be used by default. Therefore the first time I compiled the code on my system containing a powerful Tesla K80 (architecture 37) and a low power Quadro P620 (architecture 60) I selected 37 and had the same error as you have while when selecting 60 the code would run. I then used then the Querying Device Properties example to give me a list of the CUDA-capable devices and their corresponding device id, just to find out that on my system the Tesla K80 is set as 1 and 2 while the simple Quadro P620 graphics card is set as 0. I assume this is the case as the K80 is deprecated in CUDA 11!

You can select the device inside your code with cudaSetDevice or change it when launching the program with
```
$ CUDA_VISIBLE_DEVICES="1" ./my_app
```
where 1 has to be replaced by the device id you wish to use. Doing so should make your code run without any problems.

You can also test if this really is the issue this by cloning the Github repository of "Learn CUDA Programming", then browsing Chapter01/01_cuda_introduction/01_hello_world/, compile the make file with $ make and finally run it with $ ./hello_world. It automatically compiles for multiple architectures/compute capabilities and should therefore run without any issue!

Thanks for the explanation; I made some progress. However, after adding `cudaDeviceSynchronize();` below `functionA<<<1,1>>>();` and doing `CUDA_VISIBLE_DEVICES="1" ./my_app`, the program only outputs "main1" and "main2". It appears that functionA()'s printf is not being executed. — Thomas Contreras, Apr 23 '21 at 21:30
@ThomasContreras Can you run the Querying Device Properties example linked above and tell me which devices it detects? Then I can tell you which compute architecture you have to set for your GPU. — 2b-t, Apr 23 '21 at 21:33
Additionally you can try the Github repository on the bottom to check if this actually works with your set-up out of the box. — 2b-t, Apr 23 '21 at 21:33
It detects Tesla K20m. Please let me know if you need more information. I'll take a look at the Github. — Thomas Contreras, Apr 23 '21 at 21:36
@ThomasContreras Which id has the Tesla K20m got? The compute capability for the K20m is 35. — 2b-t, Apr 23 '21 at 21:40
So compile it with `nvcc -gencode=arch=compute_35,code=sm_35 -o my_app my_app.cu` and run it with `CUDA_VISIBLE_DEVICES="id" ./my_app` where `id` is the number outputted by the Querying Device Properties. — 2b-t, Apr 23 '21 at 21:41
Sorry, what exactly do you mean by id? From the Querying Device Properties, would it be the Device Number? — Thomas Contreras, Apr 23 '21 at 21:44
Yeah, the Querying Device Properties you will get a list of devices and their device id. My K80 has two computational units so I have to device ids. Your K20 on the other hand should only have one. If it is `0` then you do not need to put `CUDA_VISIBLE_DEVICES="1"`, it should work without. — 2b-t, Apr 23 '21 at 21:48

CUDA GPU __global__ function does not complete

1 Answers1

CUDA GPU global function does not complete