1

Consider the following program:

#include <cuda/api_wrappers.hpp>

namespace kernels {
template <typename T>
__global__ void print_stuff()
{
        printf("This is a plain printf() call.\n");
}
} // namespace kernels

int main()
{
        auto launch_config { cuda::make_launch_config(2,2) };
        cuda::launch(::kernels::print_stuff<int>, launch_config);
        cuda::outstanding_error::ensure_none();
}

(it uses the cuda-api-wrappers library).

The program compiles and runs. However, if I run in in a terminal, it prints nothing; while if I run it via nvvp, the console shows me:

This is a plain printf() call.
This is a plain printf() call.
This is a plain printf() call.
This is a plain printf() call.

... as expected (2 blocks x 2 threads = 4 lines).

What is/could be the reason am I not getting the four lines printed on the terminal as well?

Notes:

  • I realize the fault may theoretically be with the library, of which I am the author. So "it has to be the library" is a legitimate answer, but you need to explain why it can't be anything else.
  • No warnings when compiling with nvcc -Xcompiler -Wall -Xcompiler -Wextra.
  • I use Devuan GNU/Linux 3 (beowulf; equivalent of Debian Buster).
  • My hardware: An AMD64 Intel CPU; a GTX 1050 Ti card.
  • nVIDIA Driver version: 430.50; CUDA version: 10.1.105 .
  • cuda-memcheck does not complain about the program.
einpoklum
  • 118,144
  • 57
  • 340
  • 684
  • 3
    put a `cudaDeviceSynchronize()` at the end of your host code – Robert Crovella Oct 23 '19 at 21:54
  • @RobertCrovella: You mean, `cuda::device::current::get().synchronize()` :-) ... and yes, that works. – einpoklum Oct 23 '19 at 21:57
  • 3
    well, that is why you're not getting printout. Everything from your kernel launch to the end of your host is asynchronous. Therefore it all runs without forcing the kernel to complete. By the time you get to the closing curly-brace of main, the kernel has not completed and gotten a chance to dump its output into the host console pipe. The output pipe gets disconnected from your app at app teardown time. Therefore nothing gets printed out. And yes, the behavior can be different if you are running it under a tool. – Robert Crovella Oct 23 '19 at 22:02
  • @RobertCrovella: Made that into an answer. – einpoklum Oct 23 '19 at 22:04
  • 1
    I'm sure this is a duplicate of other questions here on the `cuda` tag. – Robert Crovella Oct 23 '19 at 22:05
  • @RobertCrovella: That's possible, I suppose. Actually, when I first created this question I thought it was a doctest instrumentation issue. – einpoklum Oct 23 '19 at 22:06
  • @tera: The underlying problem is the same, but the question isn't, because it's about the different "symptoms" running stand-alone and in a profiler. Also it's a question about whether the wrapper library is at fault. So IMHO not quite a dupe. – einpoklum Oct 24 '19 at 06:52

1 Answers1

1

You are implicitly, and mistakenly, assuming a certain order of occurrences when main() is done. Specifically, you're assuming that because the default stream is synchronous, everything having to do with your kernel is over and done with by the time the next line of code after the kernel launch gets executed. That is not 100% true - as @RobertCrovella suggests; specifically, it's not guaranteed that the device's printf() buffer will be ferried back into host memory and dumped into the standard output stream before control returns to your program.

You will need to synchronize the (default, current) CUDA device with the host, i.e. execute:

cuda::device::current::get().synchronize();

or at least synchronize the device's default stream:

cuda::device::current::get().default_stream().synchronize();

and this will ensure the printf() results make it to standard output.

Now, nvvp instruments your execution in some way (probably just by having the profiler running - but nvprof instruments the execution through the hooks which are the CUDA runtime API calls). So, the behavior is different when you run your program that way.


Somewhat-related question: The behavior of stream 0 (default) and other streams .

einpoklum
  • 118,144
  • 57
  • 340
  • 684