2

I created a CUDA function for calculating the sum of an image using its histogram.

I'm trying to compile the kernel and the wrapper function for multiple compute capabilities.

Kernel:

__global__ void calc_hist(unsigned char* pSrc, int* hist, int width, int height, int pitch)
{
    int xIndex = blockIdx.x * blockDim.x + threadIdx.x;
    int yIndex = blockIdx.y * blockDim.y + threadIdx.y;

#if __CUDA_ARCH__ > 110   //Shared Memory For Devices Above Compute 1.1
    __shared__ int shared_hist[256];
#endif

    int global_tid = yIndex * pitch + xIndex;

    int block_tid = threadIdx.y * blockDim.x + threadIdx.x;

    if(xIndex>=width || yIndex>=height) return;

#if __CUDA_ARCH__ == 110 //Calculate Histogram In Global Memory For Compute 1.1

    atomicAdd(&hist[pSrc[global_tid]],1);   /*< Atomic Add In Global Memory */

#elif __CUDA_ARCH__ > 110   //Calculate Histogram In Shared Memory For Compute Above 1.1

    shared_hist[block_tid] = 0;   /*< Clear Shared Memory */
    __syncthreads();

    atomicAdd(&shared_hist[pSrc[global_tid]],1);    /*< Atomic Add In Shared Memory */
    __syncthreads();

    if(shared_hist[block_tid] > 0)  /* Only Write Non Zero Bins Into Global Memory */
        atomicAdd(&(hist[block_tid]),shared_hist[block_tid]);
#else 
    return;     //Do Nothing For Devices Of Compute Capabilty 1.0
#endif
}

Wrapper Function:

int sum_8u_c1(unsigned char* pSrc, double* sum, int width, int height, int pitch, cudaStream_t stream = NULL)
{

#if __CUDA_ARCH__ == 100
    printf("Compute Capability Not Supported\n");
    return 0;

#else
    int *hHist,*dHist;
    cudaMalloc(&dHist,256*sizeof(int));
    cudaHostAlloc(&hHist,256 * sizeof(int),cudaHostAllocDefault);

    cudaMemsetAsync(dHist,0,256 * sizeof(int),stream);

    dim3 Block(16,16);
    dim3 Grid;

    Grid.x = (width + Block.x - 1)/Block.x;
    Grid.y = (height + Block.y - 1)/Block.y;

    calc_hist<<<Grid,Block,0,stream>>>(pSrc,dHist,width,height,pitch);

    cudaMemcpyAsync(hHist,dHist,256 * sizeof(int),cudaMemcpyDeviceToHost,stream);

    cudaStreamSynchronize(stream);

    (*sum) = 0.0;
    for(int i=1; i<256; i++)
        (*sum) += (hHist[i] * i);

    printf("sum = %f\n",(*sum));

    cudaFree(dHist);
    cudaFreeHost(hHist);

    return 1;
#endif

}

Question 1:

When compiling for sm_10, the wrapper and the kernel shouldn't execute. But that is not what happens. The whole wrapper function executes. The output shows sum = 0.0.

I expected the output to be Compute Capability Not Supported as I have added the printf statement in the start of the wrapper function.

How can I prevent the wrapper function from executing on sm_10? I don't want to add any run-time checks like if statements etc. Can it be achieved through template meta programming?

Question 2:

When compiling for greater than sm_10, the program executes correctly only if I add cudaStreamSynchronize after the kernel call. But if I do not synchronize, the output is sum = 0.0. Why is it happening? I want the function to be asynchronous w.r.t the host as much as possible. Is it possible to shift the only loop inside the kernel?

I am using GTX460M, CUDA 5.0, Visual Studio 2008 on Windows 8.

sgarizvi
  • 16,623
  • 9
  • 64
  • 98
  • 3
    For question 1, [__ CUDA_ARCH__ is always undefined when compiling host code](http://stackoverflow.com/questions/8796369/cuda-and-nvcc-using-the-preprocessor-to-choose-between-float-or-double). I don't have an answer for how to do it with template meta programming, or without using an if statement. Obviously you could query the compute capability at runtime but that would be using an if statement. For question 2 use a cudaMemcpy instead of cudaMemcpyAsync. The kernel execution will already be asynchronous to the host, and you want the cudaMemcpy to wait for completion of the kernel. – Robert Crovella Nov 17 '12 at 18:27
  • Agree on []__CUDA_ARCH__[/] usage. The use of cudaMemcpyAsync seems fine to me, as it will wait for the completion of a preceeding kernel in the same stream. All operations within a particular stream occur in order. However, since the operations in the stream execute asynchronously to the host, there needs to be synchronization with the stream before the host can retrieve results produced by the GPU. – njuffa Nov 17 '12 at 21:24
  • Agreed. My reasoning was incorrect. However I think a cudaMemcpy in lieu of cudaMemcpyAsync will fix it, without the need for a stream synchronize. This eliminates the possibility for the cudaMemcpy operation to return control to the CPU until the copy is completed. And as you say (my error), the copy will not begin until the kernel is complete. – Robert Crovella Nov 17 '12 at 21:54
  • The simple cudaMemcpy will be executed on the default cuda stream. I read in one of the SO answers that calling a cuda function on the default cuda stream between 2 async calls will revoke the overlapping of the user created streams. – sgarizvi Nov 17 '12 at 22:16
  • Agreed. But you didn't appear to be overlapping anything at that point in your application. You're simply waiting for the results before you can sum. But the Async memcpy issued to the stream followed by the stream synchronize works too, and will be more convenient to overlap in other scenarios. – Robert Crovella Nov 17 '12 at 22:51
  • Thank you, I got the idea. :) – sgarizvi Nov 18 '12 at 10:11

1 Answers1

2

Ad. Question 1

As already Robert explained in the comments - __CUDA_ARCH__ is defined only when compiling device code. To clarify: when you invoke nvcc, the code is parsed and compiled twice - once for CPU and once for GPU. The existence of __CUDA_ARCH__ can be used to check which of those two passes occurs, and then for the device code - as you do in the kernel - it can be checked which GPU are you targetting.

However, for the host side it is not all lost. While you don't have __CUDA_ARCH__, you can call API function cudaGetDeviceProperties which returns lots of information about your GPU. In particular, you can be interested in fields major and minor which indicate the Compute Capability. Note - this is done at run-time, not a preprocessing stage, so the same CPU code will work on all GPUs.

Ad. Question 2

Kernel calls and cudaMemoryAsync are asynchronous. It means that if you don't call cudaStreamSynchronize (or alike) the followup CPU code will continue running even if your GPU hasn't finished your work. This means, that the data you copy from dHist to hHist might not be there yet when you begin operating on hHist in the loop. If you want to work on the output from a kernel you have to wait till the kernel finishes.

Note that cudaMemcpy (without Async) has an implicit synchronization inside.

Caio S. Souza
  • 141
  • 3
  • 11
CygnusX1
  • 20,968
  • 5
  • 65
  • 109