Nested for loop with += assignement

Question

I have the following code in Java:

float in[][] = new float[2][2];

float B[] = new float[2]; 
float A[] = new float[2]; 
float res[] = new float[A[0].length];

for(float r : res){
    r = 0;
}

for (int i = 0; i < A[0].length; i++) {
    for (int j = 0; j < B[0].length; j++) {
        res[i] += A[j] * in[j][i];
}

I simplified it at most, so you should not search for a real logic in there :).

I struggle for some hours converting this in CUDA because of the += statement in the loop.

I started with something like this :

extern "C"
__global__ void filter(float* in, float* A, float* B, float* res, int in_size){

    unsigned int x = blockIdx.x*blockDim.x + threadIdx.x;
    unsigned int y = blockIdx.y*blockDim.y + threadIdx.y;

    res[x] = A[y] * in[x + y * in_width];

}

but quickly realized it couldn't work because of all the threads trying to set the same variable.

I read the example of the dot product in this presentation, but I don't really see how to adapt that with my need of two dimensions for in.

I don't ask for a complete solution, but any direction would definitely be appreciated.

Thx,

score 1 · Accepted Answer · answered Sep 25 '12 at 14:07

1

Too much CUDA killed my head.

I found a partial solution by unrolling one of the loops inside my kernel. Here it what it looks like right now :

extern "C"
__global__ void filter(float* in, float* A, float* res, const int in_width, const int sizeB){
    unsigned int x = blockIdx.x*blockDim.x + threadIdx.x;
    int i = 0;

    for(i = 0; i < sizeB; i++){
        res[x] += A[i] * in[i + x * in_width];
    }

}

I am sure I can find better, but I think I'll stick with this for today :)

answered Sep 25 '12 at 14:07

jlengrand

12,152
14
57
87

You are still incrementing res with all the threads, this shouldn't work either. If it's just floats you can use atomics (should be slow though) or you can implement a proper binary tree reduction for every block and then sum up the block sums on the cpu. – Reguj Sep 25 '12 at 14:23
It does work. I am incremeting an element in the loop for the same thread. In this case, each x corresponds to a single thread, so loop is safe. The downside is that each thread has to perform a loop, so I only half optimized the process – jlengrand Sep 25 '12 at 18:12
1

This looks like how I would first parallelize it. You need to initialize the res array to zero, which I assume you are doing outside this kernel. It would be faster to just put res[x] = 0; before the loop inside the kernel. Next comes optimization... – harrism Sep 25 '12 at 22:51
thx for feedback. Good to know that others would do the same as I did :) – jlengrand Sep 26 '12 at 07:02

score 0 · Answer 2 · answered Sep 25 '12 at 17:04

0

You can split up the multiplication job A[j] * in[j][i] in A[0].length*B[0].length threads and can sum up the results of multiplication as like reduction sample in NVIDIA sdk using shared memory

answered Sep 25 '12 at 17:04

rps

221
2
6

then I'd need a sizeA * sizeB array. Might be a problem as I work aith huge tables – jlengrand Sep 26 '12 at 07:03

Nested for loop with += assignement

2 Answers2