1

I have the following code in Java:

float in[][] = new float[2][2];

float B[] = new float[2]; 
float A[] = new float[2]; 
float res[] = new float[A[0].length];

for(float r : res){
    r = 0;
}

for (int i = 0; i < A[0].length; i++) {
    for (int j = 0; j < B[0].length; j++) {
        res[i] += A[j] * in[j][i];
}

I simplified it at most, so you should not search for a real logic in there :).

I struggle for some hours converting this in CUDA because of the += statement in the loop.

I started with something like this :

extern "C"
__global__ void filter(float* in, float* A, float* B, float* res, int in_size){

    unsigned int x = blockIdx.x*blockDim.x + threadIdx.x;
    unsigned int y = blockIdx.y*blockDim.y + threadIdx.y;

    res[x] = A[y] * in[x + y * in_width];

}

but quickly realized it couldn't work because of all the threads trying to set the same variable.

I read the example of the dot product in this presentation, but I don't really see how to adapt that with my need of two dimensions for in.

I don't ask for a complete solution, but any direction would definitely be appreciated.

Thx,

jlengrand
  • 12,152
  • 14
  • 57
  • 87

2 Answers2

1

Too much CUDA killed my head.

I found a partial solution by unrolling one of the loops inside my kernel. Here it what it looks like right now :

extern "C"
__global__ void filter(float* in, float* A, float* res, const int in_width, const int sizeB){
    unsigned int x = blockIdx.x*blockDim.x + threadIdx.x;
    int i = 0;

    for(i = 0; i < sizeB; i++){
        res[x] += A[i] * in[i + x * in_width];
    }

}

I am sure I can find better, but I think I'll stick with this for today :)

jlengrand
  • 12,152
  • 14
  • 57
  • 87
  • You are still incrementing res with all the threads, this shouldn't work either. If it's just floats you can use atomics (should be slow though) or you can implement a proper binary tree reduction for every block and then sum up the block sums on the cpu. – Reguj Sep 25 '12 at 14:23
  • It does work. I am incremeting an element in the loop for the same thread. In this case, each x corresponds to a single thread, so loop is safe. The downside is that each thread has to perform a loop, so I only half optimized the process – jlengrand Sep 25 '12 at 18:12
  • 1
    This looks like how I would first parallelize it. You need to initialize the res array to zero, which I assume you are doing outside this kernel. It would be faster to just put res[x] = 0; before the loop inside the kernel. Next comes optimization... – harrism Sep 25 '12 at 22:51
  • thx for feedback. Good to know that others would do the same as I did :) – jlengrand Sep 26 '12 at 07:02
0

You can split up the multiplication job A[j] * in[j][i] in A[0].length*B[0].length threads and can sum up the results of multiplication as like reduction sample in NVIDIA sdk using shared memory

rps
  • 221
  • 2
  • 6