figuring out how many blocks and threads for a cuda kernel, and how to use them

Question

I have been trying to figure out how to make what I thought would be a simple kernel to take the average of the values in a 2d matrix, but I am having some issues getting my thought process straight on it.

According to my deviceQuery output, my GPU has 16MP, 32cores/mp, blocks max is 1024x1024x64 and I have a max threads/block=1024.

So, I am working on processings some large images. Maybe 5000px x 3500px or something like that. One of my kernels is taking an average of some values across all pixels in the image.

The existing code has the images stored as a 2D array [rows][cols]. So that kernel, in C, looks like you'd expect, wtih a loop over rows, and a loop over cols, with the calculation in the middle.

So how do I set up the dimension calculation portion of this code in CUDA? I have looked at the reduction code int he SDK, but that is for a single dimension array. It doesnt have any mention of how to set up number of blocks and threads for when you have soemthing 2D.

I am thinking I'd actually need to set it up like so, and this is where I'd like someone to chime in and help:

num_threads=1024;
blocksX = num_cols/sqrt(num_threads);
blocksY = num_rows/sqrt(num_threads);
num_blocks = (num_rows*num_cols)/(blocksX*blocksY);

dim3 dimBlock(blocksX, blocksY, 1);
dim3 dimGrid(num_blocks, 1, 1);

Does this seem to make sense for the setup?

And then in the kernel, to work on a particular row or column, i'd have to use

rowidx = (blockIdx.x*blockDim.x)+threadId.x colidx = (blockIdx.y*blockDim.y)+threadId.y

At least I think that would work for getting a row and column.

How would I then access that particular row r and column c in the kernel? In the cuda programming guide I found the following code:

// Host code int width = 64, height = 64;
float* devPtr; size_t pitch;
cudaMallocPitch(&devPtr, &pitch, width * sizeof(float), height);
MyKernel<<<100, 512>>>(devPtr, pitch, width, height);
// Device code __global__ void MyKernel(float* devPtr, size_t pitch, int width, int height)
{
for (int r = 0; r < height; ++r)
{
float* row = (float*)((char*)devPtr + r * pitch);
for (int c = 0; c < width; ++c)
{
float element = row[c];
}
}
}

Which looks similar to how you'd use malloc in C to declare a 2D array, but it doesnt have any mention of accessing that array in your own kernel. I guess in my code, I will use that cudaMallocPitch call, and then perform a memcpy to get my data into the 2D array on the device?

Any tips appreciated! Thanks!

score 3 · Answer 1 · answered Apr 29 '15 at 12:11

Recently, I figured this question in the following fashion.

// Grid and block size
const dim3 blockSize(16,16,1);
const dim3 gridSize(numRows, numCols, 1); 
// kernel call
rgba_to_greyscale<<<gridSize, blockSize>>>(d_rgbaImage, d_greyImage, numRows, numCols

gridsize = Number of Block
blocksize = Threads per Block

Here is the corresponding kernel

__global__ void rgba_to_greyscale(const uchar4* const rgbaImage,
                       unsigned char* const greyImage,
                       int numRows, int numCols)
{ 
    int idx = blockIdx.x + blockIdx.y * numRows;
    uchar4 pixel     = rgbaImage[idx]; 
    float  intensity = 0.299f * pixel.x + 0.587f * pixel.y + 0.114f * pixel.z;  
    greyImage[idx]   = static_cast<unsigned char>(intensity);   
}

Good luck!!!

score 1 · Answer 2 · answered Jan 26 '11 at 21:12

For performance applications like this you need to store 2D matrix information as a single array in memory. So if you have an M x N matrix then you can store it in a single array of length M*N.

So if you want to store the 2x2 matrix

(1 , 2)
(3 , 4)

Then you create a single array an initialize the elements at row i, and column j using the following.

int rows=2;
int cols=2;
float* matrix = malloc(sizeof(float)*rows*cols);
matrix[i*cols+j]=yourValue;
//element 0,0
matrix[0*cols+0]=1.0;
//element 0,1
matrix[0*cols+1]=2.0;
//element 1,0
matrix[1*cols+0]=3.0;
//element 1,1
matrix[1*cols+1]=4.0;

This way of taking an 2D array and storing it a single continuous piece of memory in this way is called storing data in Row-major order. See Wikipedia article here. Once you change the layout of your data to this kind of format you can use the reduction that was shown in the SDK and your code should be a lot faster as you will be able to do more coalesced reads in the GPU kernel code.

I agree this is the simplest (and likely most efficient) way to address this problem. My only concern is precision: if you are doing a sum reduction of very large images with high-precision pixels, then you may run out of bits, so ensure you use a large enough datatype. Alternatively, you could modify the reduction to compute a running average rather than a sum. — harrism, Sep 04 '12 at 00:33

score 0 · Answer 3 · answered Jan 26 '11 at 20:18

Below is a short snippet with a simple kernel from my own code. The float pointers are all device pointers. Hope this is helpful.

Defines and help functions:

#define BLOCK_SIZE 16

int iDivUp(int a, int b){
    return (a % b != 0) ? (a / b + 1) : (a / b);
}

Block size calculation:

dim3 dimBlock(BLOCK_SIZE, BLOCK_SIZE);
dim3 dimGridProj(iDivUp(width,BLOCK_SIZE), iDivUp(height,BLOCK_SIZE));

Host call:

calc_residual<<<dimGridProj, dimBlock>>>(d_image1, d_proj1, d_raynorm1, d_resid1, width, height);

Kernel:

__global__ void calc_residual(float *d_imagep, float *d_projp, float *d_raysump, float *d_residualp, int width, int height)
{
    int iy = blockDim.y * blockIdx.y + threadIdx.y;
if (iy >= height) {
    return;
}
int ix = blockDim.x * blockIdx.x + threadIdx.x;
if (ix >= width) {
    return;
}
int idx = iy * width + ix;
float raysumv = d_raysump[idx];
if (raysumv > 0.001) {
    d_residualp[idx] = (d_projp[idx]-d_imagep[idx])/raysumv;
} 
else{
    d_residualp[idx] = 0;
}
}

If I understand what iDivUP is doing, you can simplify the logic a bit thanks to integer truncation: return (a+b-1)/b; — Erich Mirabal, Feb 15 '13 at 02:41

figuring out how many blocks and threads for a cuda kernel, and how to use them

3 Answers3