printf inside CUDA global function

Question

I am currently writing a matrix multiplication on a GPU and would like to debug my code, but since I can not use printf inside a device function, is there something else I can do to see what is going on inside that function. This my current function:

__global__ void MatrixMulKernel(Matrix Ad, Matrix Bd, Matrix Xd){

    int tx = threadIdx.x;
    int ty = threadIdx.y;

    int bx = blockIdx.x;
    int by = blockIdx.y;

    float sum = 0;

    for( int k = 0; k < Ad.width ; ++k){
        float Melement = Ad.elements[ty * Ad.width + k];
        float Nelement = Bd.elements[k * Bd.width + tx];
        sum += Melement * Nelement;
    }

    Xd.elements[ty * Xd.width + tx] = sum;
}

I would love to know if Ad and Bd is what I think it is, and see if that function is actually being called.

score 78 · Accepted Answer · edited Oct 25 '17 at 13:37

78

CUDA now supports printfs directly in the kernel. For formal description see Appendix B.16 of the CUDA C Programming Guide.

edited Oct 25 '17 at 13:37

psukys

387
2
6
20

answered Jul 05 '11 at 17:10

M. Tibbits

8,400
8
44
59

12

I think the link is not pointing to the right place anymore. Here is an alternate link: http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#formatted-output – cyang Jan 28 '13 at 00:55
13

Note: "now" means compute capability 2.x or higher. – colgur Feb 22 '13 at 16:08
4

and you'll therefore need to pass an argument `-arch sm_20` or similar to `nvcc`, otherwise it will not compile in a `__global__` function. – Andre Holzner Aug 10 '13 at 20:41
The relevant section is B.17 now, not B.14. – Steven Lu Sep 14 '16 at 14:18
...and the relevant section now is B.16 – psukys Oct 25 '17 at 12:16
B.19 and counting :D – Hack06 Aug 10 '18 at 20:00
And now it's B.20! – Gumby The Green Sep 07 '19 at 02:53
B.32... Maybe just search for "Formatted Output"? – paleonix Oct 18 '22 at 13:54

Tom · Answer 2 · 2014-10-27T21:19:16.543

EDIT

To avoid misleading people, as M. Tibbits points out printf is available in any GPU of compute capability 2.0 and higher.

END OF EDIT

You have choices:

Use a GPU debugger, i.e. cuda-gdb on Linux or Nexus on Windows
Use cuprintf, which is available for registered developers (sign up here)
Manually copy the data that you want to see, then dump that buffer on the host after your kernel has completed (remember to synchronise)

Regarding your code snippet:

Consider passing the Matrix structs in via pointer (i.e. cudaMemcpy them to the device, then pass in the device pointer), right now you will have no problem but if the function signature gets very large then you may hit the 256 byte limit
You have inefficient reads from Ad, you will have a 32-byte transaction to the memory for each read into Melement - consider using shared memory as a staging area (c.f. the transposeNew sample in the SDK)

score 4 · Answer 3 · answered Feb 09 '10 at 00:00

4

cuprintf
try Nexus http://developer.nvidia.com/object/nexus.html

by the way..

use shared memory
multiply outside of the loop
Look at this: http://www.seas.upenn.edu/~cis665/LECTURES/Lecture11.ppt

answered Feb 09 '10 at 00:00

Juan Leni

6,982
5
55
87

score 2 · Answer 4 · answered Oct 29 '13 at 19:47

2

See "Formatted output" (currently B.17) section of CUDA C Programming Guide.

http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html

answered Oct 29 '13 at 19:47

Andrei Pokrovsky

3,590
3
26
17

printf inside CUDA global function

4 Answers4

Linked

printf inside CUDA __global__ function

4 Answers4

Linked

printf inside CUDA global function