How to debug error code 77: cudaErrorIllegalAddress when cuda-memcheck finds no issues?

Question

I've got a complex Cuda C++ application that runs fine until I build the cuda code -g -G. When built debug, it crashes with cudaErrorIllegalAddress, so I ran cuda-memcheck, but that finds no issues. I also ran valgrind and it found no issues.

When cuda-memcheck doesn't find anything, what are some strategies to figure out where the illegal memory read or write is occurring in the Cuda code? Is there a way with the cuda-gdb to get to the line of code with invalid memory access?

If you are getting `cudaErrorIllegalAddress` and `cuda-memcheck` doesn't report anything, I would say that is a bug in `cuda-memcheck`. I would recommend filing a bug report at developer.nvidia.com with a complete demonstration of how to observe that. — Robert Crovella, Apr 26 '19 at 19:45
@RobertCrovella I think my bug was a race condition, so the invalid memory access may have gone away when run with cuda-memcheck due to a timing change in the code. — WilliamKF, Apr 29 '19 at 19:08
Perhaps if the access pattern was data driven. And the race condition exposed data that would otherwise not be seen. That would be a fairly unusual combination but possible I suppose. In that case I would expect the illegal address error report to also disappear when run under cuda memcheck. So I will revise my previous statement to say that if you run a code under cuda memcheck and you get illegal address but no identification of the actual kernel fault, I would consider filing a bug for that. — Robert Crovella, Apr 29 '19 at 19:12
Also, errors disappearing when run under cuda memcheck is an indication generally of race conditions. Although they dont trap every kind of race condition, cuda memcheck has a subtool that can report shared memory race conditions. — Robert Crovella, Apr 29 '19 at 19:22
@RobertCrovella That's what I had here, a `__shared__` memory race condition that was fixed with `__syncThreads()`. — WilliamKF, Apr 29 '19 at 19:54

score 1 · Answer 1 · answered Apr 26 '19 at 19:31

I ran my application under cuda-gdb and the debugger got a breakpoint at the invalid memory access:

CUDA Exception: Warp Illegal Address
The exception was triggered at PC 0x50225260 (gpu_core.h:275)

Thread 1 "preprocess" received signal CUDA_EXCEPTION_14, Warp Illegal Address.
[Switching focus to CUDA kernel 0, grid 617, block (0,0,0), thread (0,0,0), device 0, sm 0, warp 2, lane 0]
0x0000000050225270 in thrust::cuda_cub::cub::BlockRadixSort<int, 256, 19, int, 6, true, (thrust::cuda_cub::cub::BlockScanAlgorithm)2, (cudaSharedMemConfig)1, 1, 1, 700>::BlockRadixSort (this=0x0, this=0x0, temp_storage=0xffffb17cffff159c, keys=0x1300001d1300001d, values=0x1300001d1300001d, begin_bit=<optimized out>, end_bit=<optimized out>, is_descending=..., is_keys_only=..., num_items=<optimized out>) at /home/user/git/infra/libgpu/src/gpu_core.h:275

How to debug error code 77: cudaErrorIllegalAddress when cuda-memcheck finds no issues?

1 Answers1