Why does printf() work within a kernel, but using std::cout doesn't?

Question

I have been exploring the field of parallel programming and have written basic kernels in Cuda and SYCL. I have encountered a situation where I had to print inside the kernel and I noticed that std::cout inside the kernel does not work whereas printf works. For example, consider the following SYCL Codes - This works -

void print(float*A, size_t N){
    buffer<float, 1> Buffer{A, {N}};
    queue Queue((intel_selector()));
    Queue.submit([&Buffer, N](handler& Handler){
       auto accessor = Buffer.get_access<access::mode::read>(Handler);
       Handler.parallel_for<dummyClass>(range<1>{N}, [accessor](id<1>idx){
           printf("%f", accessor[idx[0]]);
       });
    });
}

whereas if I replace the printf with std::cout<<accessor[idx[0]] it raises a compile time error saying - Accessing non-const global variable is not allowed within SYCL device code. A similar thing happens with CUDA kernels. This got me thinking that what may be the difference between printf and std::coout which causes such behavior.

Also suppose If I wanted to implement a custom print function to be called from the GPU, how should I do it?
TIA

einpoklum · Accepted Answer · 2022-10-28T13:23:47.577

This got me thinking that what may be the difference between printf and std::cout which causes such behavior.

Yes, there is a difference. The printf() which runs in your kernel is not the standard C library printf(). A different call is made, to an on-device function (the code of of which is closed, if it at all exists in CUDA C). That function uses a hardware mechanism on NVIDIA GPUs - a buffer for kernel threads to print into, which gets sent back over to the host side, and the CUDA driver then forwards it to the standard output file descriptor of the process which launched the kernel.

std::cout does not get this sort of a compiler-assisted replacement/hijacking - and its code is simply irrelevant on the GPU.

A while ago, I implemented an std::cout-like mechanism for use in GPU kernels; see this answer of mine here on SO for more information and links. But - I decided I don't really like it, and it compilation is rather expensive, so instead, I adapted a printf()-family implementation for the GPU, which is now part of the cuda-kat library (development branch).

That means I've had to answer your second question for myself:

If I wanted to implement a custom print function to be called from the GPU, how should I do it?

Unless you have access to undisclosed NVIDIA internals - the only way to do this is to use printf() calls instead of C standard library or system calls on the host side. You essentially need to modularize your entire stream over the low-level primitive I/O facilities. It is far from trivial.

To address why printf() works in the SYCL example code, I'd like to add that this is not standard SYCL, but some implementations happen to support printf() in kernels as extension. For a cout-like mechanism, the SYCL standard provides the `stream` class as Rod points out in his answer. — illuhad, Feb 02 '21 at 17:03

Rod Burns · Answer 2 · 2021-02-02T13:54:53.697

3

In SYCL you cannot use std::cout for output on code not running on the host for similar reasons to those listed in the answer for CUDA code.

This means if you are running kernel code on the "device" (e.g. a GPU) then you need to use the stream class. There is more information about this in the SYCL developer guide section called Logging.

edited Feb 02 '21 at 13:54

answered Feb 02 '21 at 09:01

Rod Burns

2,104
13
24

Can you elaborate a little more regarding how the stream class is similar, or dissimilar, from an `std::ostream`? – einpoklum Feb 02 '21 at 17:31
The usage is pretty similar, i.e. you call mystream << "This is a line"; but the main difference is in the way the stream object is instantiated e.g. cl::sycl::stream os(1024, 128, handler); – Rod Burns Feb 04 '21 at 15:17

score 0 · Answer 3 · edited Feb 01 '21 at 20:51

0

There is no __device__ version of std::cout, so only printf can be used in device code.

edited Feb 01 '21 at 20:51

Robert Crovella

143,785
11
213
257

answered Feb 01 '21 at 17:26

Oblivion

7,176
2
14
33

Thank you for your response @Oblivion. Could you please elaborate more? – Atharva Dubey Feb 01 '21 at 17:37
@AtharvaDubey to run a function from a kernel it has to be defined __global__ or __device__. you can call printf, because there is a __device__ printf defined. – Oblivion Feb 01 '21 at 17:43
Oh okay thanks a lot. Is there a way I could define my custom function ? – Atharva Dubey Feb 01 '21 at 17:55
@AtharvaDubey I'm afraid it won't be simple, if even possible. I would stick to printf. – Oblivion Feb 01 '21 at 18:05
1

[this](https://stackoverflow.com/questions/58420339/least-onerous-way-to-implement-generic-formatted-stream-output-in-cuda) may be of interest – Robert Crovella Feb 01 '21 at 20:56

Why does printf() work within a kernel, but using std::cout doesn't?

3 Answers3