CUDA debugging procedure for non-deterministic output

Question

I'm debugging my CUDA 4.0/Thrust-based image reconstruction code on my Ubuntu 10.10 64-bit system and I've been trying to figure out how to debug this run-time error I have in which my output images appear to some random "noise." There is no random number generator output in my code, so I expect the output to be consistent between runs, even if it's wrong. However, it's not...

I was just wondering if any one has a general procedure for debugging CUDA runtime errors such as these. I'm not using any shared memory in my cuda kernels. I've taken pains to avoid any race conditions involving global memory, but I could have missed something.

I've tried using gpu ocelot, but it has problems recognizing some of my CUDA and CUSPARSE function calls.

Also, my code generally works. It's just when I change this one setting that I get these non-deterministic results. I've checked all code associated with that setting, but I can't figure out what I'm doing wrong. If I can distill it to something that I can post here, I might do that, but at this point it's too complicated to post here.

Completely off-topic, have you managed to get ocelot working with Thrust, and if yes how? :-) — Kerrek SB, Jul 20 '11 at 22:11

Steve Fallows · Accepted Answer · 2011-07-25T18:46:10.670

2

Are you sure all of your kernels have proper blocksize/remainder handling? The one place we have seen non-deterministic results occurred when we had data elements at the end of the array not being processed.

Our kernels were originally were intended for data that was known to be an integer multiple of 256 elements. So we used a blocksize of 256, and did a simple division to get the number of blocks. When the data was then changed to be any length, the leftover 255 or less elements never got processed. Those spots in the output then had random data.

edited Jul 25 '11 at 18:46

answered Jul 21 '11 at 00:21

Steve Fallows

6,274
5
47
67

Thanks for the feedback! It turns out that it did have something to do with my blocksize, as you suggested. I had assigned more than enough blocks for the data I wanted to process, which is okay as long as you check to make sure that you only process threads within proper index bounds for your data. Turns out I was using the wrong bounds. Fixing it solved the problem :) – Fares Jul 21 '11 at 20:35
@Fares - yeah there are generally two ways to handle the "leftover" data in this situation. Lots of example show your way. We've found it's usually slightly faster to run the kernel a second time with one block of just the remaining number of threads. – Steve Fallows Jul 25 '11 at 18:49

CUDA debugging procedure for non-deterministic output

1 Answers1