I have been trying to figure out how to make what I thought would be a simple kernel to take the average of the values in a 2d matrix, but I am having some issues getting my thought process straight on it.
According to my deviceQuery output, my GPU has 16MP, 32cores/mp, blocks max is 1024x1024x64 and I have a max threads/block=1024.
So, I am working on processings some large images. Maybe 5000px x 3500px or something like that. One of my kernels is taking an average of some values across all pixels in the image.
The existing code has the images stored as a 2D array [rows][cols]. So that kernel, in C, looks like you'd expect, wtih a loop over rows, and a loop over cols, with the calculation in the middle.
So how do I set up the dimension calculation portion of this code in CUDA? I have looked at the reduction code int he SDK, but that is for a single dimension array. It doesnt have any mention of how to set up number of blocks and threads for when you have soemthing 2D.
I am thinking I'd actually need to set it up like so, and this is where I'd like someone to chime in and help:
num_threads=1024;
blocksX = num_cols/sqrt(num_threads);
blocksY = num_rows/sqrt(num_threads);
num_blocks = (num_rows*num_cols)/(blocksX*blocksY);
dim3 dimBlock(blocksX, blocksY, 1);
dim3 dimGrid(num_blocks, 1, 1);
Does this seem to make sense for the setup?
And then in the kernel, to work on a particular row or column, i'd have to use
rowidx = (blockIdx.x*blockDim.x)+threadId.x colidx = (blockIdx.y*blockDim.y)+threadId.y
At least I think that would work for getting a row and column.
How would I then access that particular row r and column c in the kernel? In the cuda programming guide I found the following code:
// Host code int width = 64, height = 64;
float* devPtr; size_t pitch;
cudaMallocPitch(&devPtr, &pitch, width * sizeof(float), height);
MyKernel<<<100, 512>>>(devPtr, pitch, width, height);
// Device code __global__ void MyKernel(float* devPtr, size_t pitch, int width, int height)
{
for (int r = 0; r < height; ++r)
{
float* row = (float*)((char*)devPtr + r * pitch);
for (int c = 0; c < width; ++c)
{
float element = row[c];
}
}
}
Which looks similar to how you'd use malloc in C to declare a 2D array, but it doesnt have any mention of accessing that array in your own kernel. I guess in my code, I will use that cudaMallocPitch call, and then perform a memcpy to get my data into the 2D array on the device?
Any tips appreciated! Thanks!