0

I'm pretty new to CUDA and flying a bit by the seat of my pants here...

I'm trying to debug my CUDA program on a remote machine I don't have admin rights on. I compile my program with nvcc -g -G and then try to debug it with cuda-gdb. However, as soon as gdb hits a call to a kernel (doesn't even have to enter it, and it doesn't happen in host code), I get:

(cuda-gdb) run
Starting program: /path/to/my/binary/cuda_clustered_tree 
[Thread debugging using libthread_db enabled]

[1]+  Stopped                 cuda-gdb cuda_clustered_tree

cuda-gdb then dumps me back to my terminal. If I try to run cuda-gdb again, I get

An instance of cuda-gdb (pid 4065) is already using device 0. If you believe
you are seeing this message in error, try deleting /tmp/cuda-dbg/cuda-gdb.lock.

The only way to recover is to kill -9 cuda-gdb and cuda_clustered_ (I assume the latter is part of my binary).

This machine has two GPUs, is running CUDA 4.1 (I believe -- there were a lot installed, but that's the one I set the PATH and LD_LIBRARY_PATH to) and compile + runs deviceQuery and bandwidthTest fine.

I can provide more info if need be. I've searched everywhere I could find online and found no help with this.

valiano
  • 16,433
  • 7
  • 64
  • 79
int3h
  • 462
  • 4
  • 15
  • Try issuing `fg` command in terminal after getting `[1]+ Stopped` message: it looks like the debugger process has been suspended for some reason (to suspend process yourself, press `Ctrl-Z`), and `fg` would resume it and bring to ForeGround – aland May 06 '12 at 18:53
  • No go. It just echoes the lines `uda-gdb cuda_clustered_tree [New Thread 0x7ffff1c09700 (LWP 6001)]` and `[Context Create of context 0x692010 on Device 0]`. It seems otherwise dead -- I can type text, but nothing at all happens when I hit enter. Ctrl+C and Ctrl+D don't do anything either. Only Ctrl+Z does anything -- send it to the background, and leave it a zombie process like before. – int3h May 06 '12 at 20:06
  • Try debugging some SDK programs. If for them `cuda-gdb` works fine, then there is just random bug you should file to nVidia developers. In my experience, it's not a rare event when `cuda-gdb` crashes with SEGFAULT and such... – aland May 07 '12 at 18:31
  • Hmm, no, it seems to crash with the demo apps as well. Not sure what that indicates. – int3h May 08 '12 at 04:27

1 Answers1

0

Figured it out! Turns out, cuda-gdb hates csh.

If you are running csh, it will cause cuda-gdb to exhibit the above anomalous behavior. Even running bash from within csh, then running cuda-gdb, I still saw the behavior. You need to start your shell as bash, and only bash.

On the machine, the default shell was csh, but I use bash. I wasn't allowed to change it directly, so I added 'exec /bin/bash --login' to my .login script.

So even though I was running bash, because it was started by csh, cuda-gdb would exhibit the above anomalous behavior. Getting rid of 'exec' command, so I was running csh directly with nothing on top, still showed the behavior.

In the end, I had to get IT to change my shell to bash directly (after much patient troubleshooting by them.) Now it works as intended.

int3h
  • 462
  • 4
  • 15