0

I want to be able to write something close to:

std::cout << "Hello" << my_world_string << ", " << std::setprecision(5) << my_double << '\n';

in CUDA device-side code, for debugging templated functions - and for this kind of line of code to result in a single, unbroken, output line (i.e. the equivalent of a single CUDA printf() call - which typically doesn't get mangled with output from other threads).

Of course, that's not possible since there are no files or file descriptors in device-side code, nor is any of the std::ostream code usable in device-side code. Essentially what we have to work with is CUDA's hardware+software hack enabling printf()s. But it is obviously possible to get something like:

stream  << "Hello" << my_world_string << ", " << foo::setprecision(5) << my_double << '\n'; 
stream.flush();

or:

stream  << "Hello" << my_world_string << ", " << foo::setprecision(5) << my_double << '\n'; 
printf("%s", stream.str());

My question is: What should I implement which would allow me to write code as close to the above as possible, minimizing effort / amount of code to write?

Notes:

  • I used the identifier stream but it doesn't have to be a stream. Nor does the code need to look just like I laid it out. The point is for me to be able to have printing code in a templated device function.
  • All code will be written in C++11.
  • Code may assume compilation is performed either with C++11 or a later version of the standard.
  • I can use existing FOSS code, but only if its license is permissive, e.g. 3-BSD, CC-BY-SA, MIT - but not GPL.
einpoklum
  • 118,144
  • 57
  • 340
  • 684

1 Answers1

2

Currently, the way I'm thinking of implementing this is:

  • Implement an std::ostringstream-like class which can take its initial storage from elsewhere (on construction).
  • With such an object, you can then printf("%s\n", my_gpu_sstream.str()) .
  • Allow the GPU-ostringstream to be constructed with a fixed-sized buffer.
  • Allow the GPU-ostringstream to allocate variable-size buffers using CUDA's device-side malloc().

and Bob's your uncle.

However, I would really rather avoid implementing a full-blown stringstream myself. Seems like a whole lot of redundant work and code.

Edit: I did actually implement something like this in my cuda-kat library. I used robhz786's strf library, which is (header-only-if-you-like) string formatting library not based on standard streams. On its basis I implemented an on-device stringstream, kat::stringstream, and on the basis of that, a "printf'ing ostream" class.

However, it made compilation sooo incredibly long as to make it effectively useless: I would find myself just printf'ing instead, to avoid the endless waiting. So eventually I gave it up in favor of a slightly more flexible printf(). Maybe, at a later point, I will try to implement a simplified std::print().

einpoklum
  • 118,144
  • 57
  • 340
  • 684
  • By the sounds of it, all you really need to do is pick up `basic_stringbuf` as a base and override the `sync()` function to do the internal `printf` – Niall Oct 16 '19 at 19:49
  • @Niall: Do you mean properly inherit from it, or copy-paste the code for `basic_stringbuf`? Remember that to run anything as CUDA device-side code, it needs to either be constexpr or decorated with `__device__`. – einpoklum Oct 16 '19 at 20:04
  • I see the complication, can’t remember what all is constexpr if anything there, may not be worth the headache then. A stripped out string stream may be easier. – Niall Oct 16 '19 at 20:09