Least onerous way to implement generic formatted stream output in CUDA?

Question

I want to be able to write something close to:

std::cout << "Hello" << my_world_string << ", " << std::setprecision(5) << my_double << '\n';

in CUDA device-side code, for debugging templated functions - and for this kind of line of code to result in a single, unbroken, output line (i.e. the equivalent of a single CUDA printf() call - which typically doesn't get mangled with output from other threads).

Of course, that's not possible since there are no files or file descriptors in device-side code, nor is any of the std::ostream code usable in device-side code. Essentially what we have to work with is CUDA's hardware+software hack enabling printf()s. But it is obviously possible to get something like:

stream  << "Hello" << my_world_string << ", " << foo::setprecision(5) << my_double << '\n'; 
stream.flush();

or:

stream  << "Hello" << my_world_string << ", " << foo::setprecision(5) << my_double << '\n'; 
printf("%s", stream.str());

My question is: What should I implement which would allow me to write code as close to the above as possible, minimizing effort / amount of code to write?

Notes:

I used the identifier stream but it doesn't have to be a stream. Nor does the code need to look just like I laid it out. The point is for me to be able to have printing code in a templated device function.
All code will be written in C++11.
Code may assume compilation is performed either with C++11 or a later version of the standard.
I can use existing FOSS code, but only if its license is permissive, e.g. 3-BSD, CC-BY-SA, MIT - but not GPL.

einpoklum · Answer 1 · 2023-03-02T09:48:25.513

Currently, the way I'm thinking of implementing this is:

Implement an std::ostringstream-like class which can take its initial storage from elsewhere (on construction).
With such an object, you can then printf("%s\n", my_gpu_sstream.str()) .
Allow the GPU-ostringstream to be constructed with a fixed-sized buffer.
Allow the GPU-ostringstream to allocate variable-size buffers using CUDA's device-side malloc().

and Bob's your uncle.

However, I would really rather avoid implementing a full-blown stringstream myself. Seems like a whole lot of redundant work and code.

Edit: I did actually implement something like this in my cuda-kat library. I used robhz786's strf library, which is (header-only-if-you-like) string formatting library not based on standard streams. On its basis I implemented an on-device stringstream, kat::stringstream, and on the basis of that, a "printf'ing ostream" class.

However, it made compilation sooo incredibly long as to make it effectively useless: I would find myself just printf'ing instead, to avoid the endless waiting. So eventually I gave it up in favor of a slightly more flexible printf(). Maybe, at a later point, I will try to implement a simplified std::print().

By the sounds of it, all you really need to do is pick up `basic_stringbuf` as a base and override the `sync()` function to do the internal `printf` — Niall, Oct 16 '19 at 19:49
@Niall: Do you mean properly inherit from it, or copy-paste the code for `basic_stringbuf`? Remember that to run anything as CUDA device-side code, it needs to either be constexpr or decorated with `__device__`. — einpoklum, Oct 16 '19 at 20:04
I see the complication, can’t remember what all is constexpr if anything there, may not be worth the headache then. A stripped out string stream may be easier. — Niall, Oct 16 '19 at 20:09

Least onerous way to implement generic formatted stream output in CUDA?

1 Answers1

Linked