0

I have a kernel that runs a semi-infinite loop (while loop which runs as long as a global sentinel value is true) in the background. Here's something close to a minimal example, with some lines and error-checking omitted for brevity. Assume the device supports the maximal set of asynchronous behavior (engine count, etc.) which exists on any current CUDA-enabled device:

int main() {
  [Create streams, declare variables, etc.]
  ...
  cudaMalloc(&running_d, sizeof(bool));      // Global device memory
  cudaMalloc(&output_d, sizeof(bool));       // Global device memory
  cudaMallocHost(&running_h, sizeof(bool));  // Pinned host memory
  cudaMallocHost(&output_h, sizeof(bool));   // Pinned host memory

  // Set sentinel true
  bool running_h = true;
  cudaMemcpy(running_d, running_h, sizeof(bool), cudaMemcpyHostToDevice);

  // Launch kernel
  kernel<<<1, 1, 0, stream1>>>(running_d, output_d);

  // Copy memory concurrently with kernel execution until output is false
  *output_h = true;
  while (*output_h) {
    cudaMemcpyAsync(output_h, output_d, sizeof(bool), cudaMemcpyDeviceToHost, stream2);
    cudaStreamSynchronize(stream2);
  }

  return 0;
}

__global__
void kernel(bool* running_ptr, bool* output_ptr) {
  while (*running_ptr) {
    *output_ptr = liveFunction();  // Some function which eventually will always return false
  }
}

My question effectively amounts to whether the host loop will ever be exited. In a naively assumed model of the system, the writes from *output_ptr = liveFunction() would eventually be visible to the host, and the host loop would exit.

However, in my testing (CUDA 12, RTX 4090), the call to cudaMemcpyAsync(output_h, output_d, sizeof(bool), cudaMemcpyDeviceToHost, stream2) executes asynchronously while the kernel is running, but ends up coping a stale value of true, even after I have confirmed with a device-side printf("%d\n", *output_ptr) that the value has been set to false. This persists seemingly forever (i.e. for at least several minutes). If I stop the kernel by concurrently setting running_d to false from the host, the kernel exits and the host loop then copies the updated value of false and exits.

I have also tried using atomicCAS to set output_ptr rather than the assignment operator, but I still get the same result. I'm not sure where the output_ptr value of false is being stored, as the atomicCAS seemingly implies that the new value is visible to the other device threads without being visible to the host.

As far as the purpose for the infinite kernel loop, as I am well aware this is considered something to avoid in device code, the kernel is acting as an infinite generator from which I want to pull intermediate results and potentially have user intervention. There are two advantages I am aware of to using the infinite loop that to my knowledge cannot be obtained any other way:

  • Local kernel variables can be kept in registers. If the kernel was being continually stopped and relaunched, the local data would need to be copied to and from global memory every time the kernel was stopped and relaunched.
  • Warps can diverge (from each other, while keeping their lanes coherent) without consequence. If the kernel was being stopped and relaunched, each stop would create a period of waiting on the most delayed warps to finish, potentially at low occupancy.

I am aware of an answer from 2013 that claims there were, at the time, no guarantees about memory coherence, and another answer which recommended this sentinel mechanism, albeit solely for the purpose of tricking the compiler.

By replacing the host code with something that uses unified memory and atomics:

cuda::atomic<bool, cuda::thread_scope_system>* running_ptr;
cuda::atomic<bool, cuda::thread_scope_system>* output_ptr;
cudaMallocManaged(&running_ptr, sizeof(cuda::atomic<bool, cuda::thread_scope_system>));
cudaMallocManaged(&output_ptr, sizeof(cuda::atomic<bool, cuda::thread_scope_system>));
*running_ptr = true;
*output_ptr = true;
kernel<<<1, 1, 0, stream1>>>(running_ptr, output_ptr);
while (*output_ptr);

I am able to get the desired result, although I am unsure of whether this is guaranteed behavior and whether the atomics are necessary or if this is a quirk of unified memory.

Hello
  • 1
  • 1
  • 1
    What you try to do is cross-thread behavior. The compiler is free to assume that no such thing happens and probably optimizes a whole lot of stuff away. You either need [`cuda::atomic`](https://nvidia.github.io/libcudacxx/extended_api/synchronization_primitives/atomic.html) or plain old `volatile`. – Homer512 Jul 29 '23 at 17:18
  • Using `volatile`, I still get the same behavior, although I'm having to cast `(void*)output_ptr` in the `cudaMemcpyAsync` to get proper compilation. – Hello Jul 29 '23 at 18:25
  • (In reference to a deleted comment) I had qualified both `output_d` (host) and `output_ptr` (kernel) as `volatile`. Also had a typo in my previous comment, the cast was `(void*)output_d` (since the `cudaMemcpyAsync` was on the host). – Hello Jul 29 '23 at 20:02
  • There are C++ synchronization primitives for this kind of thing nowadays. E.g. [`cuda::binary_semaphore`](https://nvidia.github.io/libcudacxx/extended_api/synchronization_primitives/binary_semaphore.html) with [`cuda::thread_scope_system`](https://nvidia.github.io/libcudacxx/extended_api/memory_model.html#thread-scopes). Note the [restrictions](https://nvidia.github.io/libcudacxx/extended_api/synchronization_primitives/binary_semaphore.html#concurrency-restrictions) though. – paleonix Jul 29 '23 at 20:03
  • Does `binary_semaphore` affect the memory model? This seems like less of a race condition (since the problem persists for a very long time even after the device should only be writing `false` to `output_ptr`) and more of a host/device memory coherence issue (which I have struggled to find good documentation about). – Hello Jul 29 '23 at 20:07
  • Is the suggestion with `binary_semaphore` that I would use the semaphore itself as the sentinel value, and the fact that it has `thread_scope_system` adds some guarantee of host/device coherence that doesn't exist with typical global memory? – Hello Jul 29 '23 at 20:11
  • The suggestion by Homer512 to use `cuda::atomic` with the suggestion by paleonix to use `cuda::thread_scope_system` works in my tests, although I'm still worried about the large body of older answers and documentation that imply host/device coherence is undefined. – Hello Jul 29 '23 at 20:48
  • To quote [cppreference.com](https://en.cppreference.com/w/cpp/thread/counting_semaphore): "Semaphores are also often used for the semantics of signaling/notifying rather than mutual exclusion, by initializing the semaphore with ​0​ and thus blocking the receiver(s) that try to `acquire()`, until the notifier 'signals' by invoking `release(n)`. In this respect semaphores can be considered alternatives to `std::condition_variables`, often with better performance." See also the example given on cppreference.com – paleonix Jul 29 '23 at 20:49
  • 1
    As Nvidia gives explicit instructions about using their semaphores with `cuda::thread_scope_system`, i.e. for synchronization between host and device, I would assume that it will work (inside the given restrictions). I'm not quite sure how you combined it with atomics though. Maybe add the working version to your question? – paleonix Jul 29 '23 at 20:54
  • I've added a code snippet, it also uses unified memory (as required by `cuda::atomic`) which may be the actual thing that fixed the problem. Still not sure about whether it's guaranteed though. – Hello Jul 29 '23 at 23:35
  • I would expect there to be performance penalties for busy waiting on that atomic bool on the host. A semaphore variant might internally implement something more efficient like exponential backoff. – paleonix Jul 30 '23 at 08:46
  • [Unified memory data usage hints](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#data-usage-hints) might also improve performance. – paleonix Jul 30 '23 at 08:59

0 Answers0