I have a kernel that runs a semi-infinite loop (while loop which runs as long as a global sentinel value is true) in the background. Here's something close to a minimal example, with some lines and error-checking omitted for brevity. Assume the device supports the maximal set of asynchronous behavior (engine count, etc.) which exists on any current CUDA-enabled device:
int main() {
[Create streams, declare variables, etc.]
...
cudaMalloc(&running_d, sizeof(bool)); // Global device memory
cudaMalloc(&output_d, sizeof(bool)); // Global device memory
cudaMallocHost(&running_h, sizeof(bool)); // Pinned host memory
cudaMallocHost(&output_h, sizeof(bool)); // Pinned host memory
// Set sentinel true
bool running_h = true;
cudaMemcpy(running_d, running_h, sizeof(bool), cudaMemcpyHostToDevice);
// Launch kernel
kernel<<<1, 1, 0, stream1>>>(running_d, output_d);
// Copy memory concurrently with kernel execution until output is false
*output_h = true;
while (*output_h) {
cudaMemcpyAsync(output_h, output_d, sizeof(bool), cudaMemcpyDeviceToHost, stream2);
cudaStreamSynchronize(stream2);
}
return 0;
}
__global__
void kernel(bool* running_ptr, bool* output_ptr) {
while (*running_ptr) {
*output_ptr = liveFunction(); // Some function which eventually will always return false
}
}
My question effectively amounts to whether the host loop will ever be exited. In a naively assumed model of the system, the writes from *output_ptr = liveFunction()
would eventually be visible to the host, and the host loop would exit.
However, in my testing (CUDA 12, RTX 4090), the call to cudaMemcpyAsync(output_h, output_d, sizeof(bool), cudaMemcpyDeviceToHost, stream2)
executes asynchronously while the kernel is running, but ends up coping a stale value of true
, even after I have confirmed with a device-side printf("%d\n", *output_ptr)
that the value has been set to false
. This persists seemingly forever (i.e. for at least several minutes). If I stop the kernel by concurrently setting running_d
to false
from the host, the kernel exits and the host loop then copies the updated value of false
and exits.
I have also tried using atomicCAS
to set output_ptr
rather than the assignment operator, but I still get the same result. I'm not sure where the output_ptr
value of false
is being stored, as the atomicCAS
seemingly implies that the new value is visible to the other device threads without being visible to the host.
As far as the purpose for the infinite kernel loop, as I am well aware this is considered something to avoid in device code, the kernel is acting as an infinite generator from which I want to pull intermediate results and potentially have user intervention. There are two advantages I am aware of to using the infinite loop that to my knowledge cannot be obtained any other way:
- Local kernel variables can be kept in registers. If the kernel was being continually stopped and relaunched, the local data would need to be copied to and from global memory every time the kernel was stopped and relaunched.
- Warps can diverge (from each other, while keeping their lanes coherent) without consequence. If the kernel was being stopped and relaunched, each stop would create a period of waiting on the most delayed warps to finish, potentially at low occupancy.
I am aware of an answer from 2013 that claims there were, at the time, no guarantees about memory coherence, and another answer which recommended this sentinel mechanism, albeit solely for the purpose of tricking the compiler.
By replacing the host code with something that uses unified memory and atomics:
cuda::atomic<bool, cuda::thread_scope_system>* running_ptr;
cuda::atomic<bool, cuda::thread_scope_system>* output_ptr;
cudaMallocManaged(&running_ptr, sizeof(cuda::atomic<bool, cuda::thread_scope_system>));
cudaMallocManaged(&output_ptr, sizeof(cuda::atomic<bool, cuda::thread_scope_system>));
*running_ptr = true;
*output_ptr = true;
kernel<<<1, 1, 0, stream1>>>(running_ptr, output_ptr);
while (*output_ptr);
I am able to get the desired result, although I am unsure of whether this is guaranteed behavior and whether the atomics are necessary or if this is a quirk of unified memory.