I can't believe after all the research and reading I've done I am still not 100% clear on how to do this, so I must ask.. I am trying to get something like the following to run on a gpu card and I am using Cudafy.Net to generate the Cuda C equivalent. I want to get this to run as fast as possible.
If I have a function (simplified) such as:
Transform()
{
for (lgDY = 0; lgDY < lgeHeight; lgDY++)
{
for (lgDX = 0; lgDX < lgeWidth; lgDX++)
{
// do a lot of stuff with lgDY and lgDX like stuff a matrix
}
}
}
I am invoking this with the Launch() function as follows:
gpu.Launch(blocksize, threadsize, "Transform", args...)
I am familiar with the GThread passed as first argument, and blocksize.x, blockdim.x and threadsize.x, and also the y and z for the block. I am having a hard time understanding if the for statements go away and I replace them with a test sort of like
if ( y < lgeHeight )
if ( x < lgeWidth )
...
But then have no idea how to "tie each iteration to an incremented lgDY and lgDX.
I apologize if it's something blatantly obvious or if I haven't described what I am trying to do accurately. Just confused on how to make the nested loop correct. I appreciate any and all help to get me moving in the right direction.