0

Anyone know likely avenues of investigation for kernel launch failures that disappear when run under cuda-gdb? Memory assignments are within spec, launches fail on the same run of the same kernel every time, and (so far) it hasn't failed within the debugger.

Oh Great SO Gurus, What now?

Framester
  • 33,341
  • 51
  • 130
  • 192
Bolster
  • 7,460
  • 13
  • 61
  • 96
  • Back in emu days, I ran it through valgrind in emu mode sometimes to find bugs. I also resorted to kernel printfs recently to solve one of these. – jmilloy Apr 20 '11 at 21:15

2 Answers2

2

cuda-gdb spills all shared memory and registers to local memory. So when something runs ok built for debugging and fails otherwise, it usually means out of bounds shared memory access. cuda-memcheck might help, depending on what sort of card you are using. Fermi is better than older cards in that respect.

EDIT: Casting my mind back to the bad old days, I remember having an ornery GT9500 which used to throw similar NV13 errors and have random code failures when running very memory intensive kernels with a lot of shared memory activity. Never when debugging. I put it down to bad hardware and moved on to a GT200, never to see a similar error since. One possibility might be bad hardware. Is this a G92 (9800GT or similar)?

talonmies
  • 70,661
  • 34
  • 192
  • 269
  • Cuda-memcheck also appears to 'fix' the problem; note: This is with pycuda.debug imported, and the documentation doesn't say anything about it (other than "Use It") – Bolster Apr 20 '11 at 20:45
  • pycuda.debug is really just a wrapper script which sets driver.set_debugging(), making all of the JIT compilation happen with the right flags to work with cuda-gdb. I am not sure what cuda-memcheck will do on an older card with shared memory. There might not be the hardware support to do all of the shared memory diagnostics that it can do on Fermi. – talonmies Apr 21 '11 at 11:22
  • @talonmies, what is the `NV13` error. Searching the web did not bring up anything useful. – Framester Feb 08 '12 at 15:27
  • 1
    @Framester: it is a class of Linux driver error, reported to the kernel ring buffer. – talonmies Feb 08 '12 at 15:28
0

CUDA GDB can make some of the cuda operations synchronous.

  • Are you reading from a memory after has been initialized ?
  • are you using Streams?
  • Are you launching more than one kernel?
  • Where and how does it fail ?
fabrizioM
  • 46,639
  • 15
  • 102
  • 119
  • More detailed question just for you @fabrizioM http://stackoverflow.com/questions/5827219/pycuda-cuda-causes-of-non-deterministic-launch-failures – Bolster Apr 29 '11 at 09:42