0

For some reason, the breakpoints I set in a specific kernel are completely ignored... I have checked the error status with cudaGetLastError(), which told me that everything ran fine so I am quite sure this should mean that the kernel has executed. Placing printf statements also yields no extra information, as nothing is printed. Even in a kernel that is entered in debug mode, the printf calls have no effect. What could go wrong here?!

We are running Cuda 4.2 on a Tesla M2075 (driver version 295.41). Output when debugging:

(cuda-gdb) break cudaCalcBeamIntersect
Breakpoint 1 at 0x401cfb: file cudacalcbeamintersect.cu, line 109.
(cuda-gdb) r
Starting program: /home/heit/cuda/vfind/vfind singleevent.txt 1 1 1 
[Thread debugging using libthread_db enabled]
[New Thread 0x7ffff5dd5700 (LWP 20241)]
[Context Create of context 0x634220 on Device 0]
[Launch of CUDA Kernel 0 (memset32_post<<<(64,1,1),(64,1,1)>>>) on Device 0]
[Launch of CUDA Kernel 1 (memset32_post<<<(8,1,1),(64,1,1)>>>) on Device 0]
[Launch of CUDA Kernel 2 (memset32_post<<<(64,1,1),(64,1,1)>>>) on Device 0]
[Launch of CUDA Kernel 3 (memset32_post<<<(1,1,1),(64,1,1)>>>) on Device 0]
[Launch of CUDA Kernel 4 (memset32_post<<<(1,1,1),(64,1,1)>>>) on Device 0]
[Launch of CUDA Kernel 5 (memset32_post<<<(8,1,1),(64,1,1)>>>) on Device 0]
[Launch of CUDA Kernel 6 (cudaInitializeGlobals<<<(256,1,1),(128,1,1)>>>) on Device 0]
no error
[Launch of CUDA Kernel 7 (cudaCalcBeamIntersect<<<(256,1,1),(128,1,1)>>>) on Device 0]
no error
Elapsed time: 0.876842 seconds.
[Thread 0x7ffff5dd5700 (LWP 20241) exited]
[Termination of CUDA Kernel 6 (cudaInitializeGlobals<<<(256,1,1),(128,1,1)>>>) on Device 0]

Program exited normally.

The "no error" prints are printed outside the kernels by calling cout << cudaGetErrorString(cudaGetLastError()) << '\n';, and indicate that both cudaInitializeGlobals() (which can be stepped through in cuda-gdb) and cudaCalcBeamIntersect() are executed without problems. The latter however, cannot be debugged.

The kernel in question is still a preliminary one, and calculates some values to be stored in (static) global memory. Nothing else is done with these values, so could it be that the compiler optimizes this call away completely? If so, why??!! And how to prevent this behavior?? (-O0 has no effect)

Cheers!

Edit - The code:

** Code calling the kernels **

    uint const nEvents = events.size();     // total number of events

    /* Not important ... */

// Allocate memory to hold the events
    Track *dev_events;                      
    cudaMalloc(&dev_events, linearEvents.size() * sizeof(Track));

// Copy all events to the GPU
    cudaMemcpy(dev_events, &linearEvents[0], linearEvents.size() * sizeof(Track), cudaMemcpyHostToDevice);

// Initialize the global data, like the histogram and the array of z-values
    cudaInitializeGlobals <<< tpb, bpg >>> ();
    cout << cudaGetErrorString(cudaGetLastError()) << '\n';

    cout << "Processing " << nEvents << " event(s)\n";
    uint linearIdx = 0;
    for (uint event = 0; event != nEvents; ++event)
    {
        uint nTracks = events[event].size();

        if (nTracks > MAX_NUMBER_OF_TRACKS)
        {
            cout << "Number of tracks in event " << event << " exceeds maximum number of tracks.\n";
            exit(1);
        }

        cudaCalcBeamIntersect <<< tpb, bpg >>> (dev_events + linearIdx, nTracks, bipThresh, binWidth);
        cout << cudaGetErrorString(cudaGetLastError()) << '\n';

    // Update linear index
        linearIdx += nTracks;
    }

cudacalcbeamintersect.cu

#include "vfind.cuh"

__device__ float    dev_zMin;
__device__ float    dev_zMax;
__device__ float    dev_zValues[MAX_NUMBER_OF_TRACKS];
__device__ uint     dev_histogram[MAX_NUMBER_OF_BINS];

__constant__ Track dev_beam = 
{
    {0, 0, 1},
    {0, 0, 0}
};

__global__ void cudaInitializeGlobals()
{
    uint const tid = threadIdx.x + blockIdx.x * blockDim.x;
    uint const nThreads = blockDim.x * gridDim.x;

    if (tid == 0)
    {
        dev_zMin = 1e6;
        dev_zMax = -1e6;
    }

    uint idx = tid;
    while (idx < MAX_NUMBER_OF_BINS || idx < MAX_NUMBER_OF_TRACKS)          
    {
        if (idx < MAX_NUMBER_OF_BINS)
            dev_histogram[idx] = 0;

        if (idx < MAX_NUMBER_OF_TRACKS)
            dev_zValues[idx] = 0;

        idx += nThreads;
    }
}

__device__ float dot(float const v1[3], float const v2[3])
{
    // Stuff
}

__device__ float distance(Track const &t1, Track const &t2)
{
    // Even more boring unimportant stuff
}

__device__ Vertex vertex(Track const &t1, Track const &t2)
{
    // Yet even more boring unimportant stuff
}

__global__ void cudaCalcBeamIntersect(Track const *tracks, uint nTracks, float bipTresh, float binWidth)
{
    uint const tid = threadIdx.x + blockIdx.x * blockDim.x;
    uint const nThreads = blockDim.x * gridDim.x;

    uint idx = tid;
    while (idx < nTracks)
    {
        float dist = distance(tracks[idx], dev_beam);
        if (dist < bipTresh)
        {
            float z = vertex(tracks[idx], dev_beam).z;

            if (z < dev_zMin)
                atomicExch(&dev_zMin, z);

            if (z > dev_zMax)
                atomicExch(&dev_zMax, z);

            dev_zValues[idx] = z;
        }

        idx += nThreads;
    }

    __syncthreads();

    // To be continued here
}
JorenHeit
  • 3,877
  • 2
  • 22
  • 28
  • Could you please post the code? – Vitality Feb 18 '13 at 11:00
  • @JackOLantern Sure I can, but it would make the post so dreadfully long... Anyhow, coming up! – JorenHeit Feb 18 '13 at 11:03
  • Have you checked that `cudaInitializeGlobals` writes the correct value in the global variables? For example, which value is written in `dev_zMin` ? – Vitality Feb 18 '13 at 11:53
  • @JackOLantern I have just checked that these values are correctly initialized. But I don't see how this would influence the call to the other kernel, even if they were not initialized correctly... – JorenHeit Feb 18 '13 at 11:58
  • In your post you are involving also `cudaInitializeGlobals`. As long as I understand, it is returning the results you expect since it correctly initializes `dev_zValues` and also the other globals, so the problem should be localized in `cudaCalcBeamIntersect`, or one of the functions called by itself. What is the content of `dev_zValues` returned by `cudaCalcBeamIntersect`? – Vitality Feb 18 '13 at 13:48
  • are you compiling with `-g -G` switches? It might be instructive to put a printf statement inside the kernel in question. This will conclusively determine whether the kernel is executing and may shed some light on what is going on. Also, if `tpb` means threads per block and `bpg` means blocks per grid, then you've got those parameters reversed in your launch parameters. Blocks per grid is the first variable, threads per block is the second. – Robert Crovella Feb 18 '13 at 15:24
  • @JackOLantern I have no idea to what values `dev_zValues` is set, as I have no means of reading out that data... This brings me to the comment of Robert, who suggests that I place a printf statement in the kernel. I have tried this, but it prints nothing (not even in the first kernel, from which I know it works). Yes, I have compiled with `-g -G` and yes, I did accidentally reverse these parameters, apologies! – JorenHeit Feb 18 '13 at 15:48
  • @JorenHeit Sorry, I do not understand why you cannot read `dev_zValues`. Just allocate a temporary array on the CPU and then move the content of `dev_zValues` from the GPU to the CPU by `cudaMemcpy` and `printf` it. Could you do that? – Vitality Feb 18 '13 at 16:25
  • Try putting a `cudaDeviceSynchronize();` in between the `cudaCalcBeamIntersect` kernel call and the `cout` statement that immediately follows it. Also, are you including `` for `printf` ? – Robert Crovella Feb 18 '13 at 22:55

1 Answers1

1

@JorenHeit Your kernel cudaCalcBeamIntersect has global memory side effects and should not be getting optimized out. Based on the posted cuda-gdb output, it looks like the host thread that had launched the work is not waiting on the work to complete (via a cudaDeviceSynchronize() call or via a cudaMemcpy from device to host). As a result, the host thread is exiting before the cudaCalcBeamIntersect kernel could be executed on the GPU. Please try adding a cudaDeviceSynchronize() call after every kernel launch in your application.

Vyas
  • 499
  • 2
  • 4