Finding least number in a set with CUDA

Question

Suppose you had a function that would take in a vector, a set of vectors, and find which vector in the set of vectors was closest to the original vector. It may be useful if I included some code:

int findBMU(float * inputVector, float * weights){


    int count = 0;
    float currentDistance = 0;
    int winner = 0;
    float leastDistance = 99999;

    for(int i = 0; i<10; i++){
        for(int j = 0;j<10; j++){
            for(int k = 0; k<10; k++){

                int offset = (i*100+j*10+k)*644;
                for(int i = offset; i<offset+644; i++){
                    currentDistance += abs((inputVector[count]-weights[i]))*abs((inputVector[count]-weights[i]));
                    count++;
                }
                currentDistance = sqrt(currentDistance);

                count = 0;
                if(currentDistance<leastDistance){
                    winner = offset;

                    leastDistance = currentDistance;

                }
                currentDistance = 0;
            }
        }
    }
    return winner;
}

In this example, weights is a single dimensional array, with a block of 644 elements corresponding to one vector. inputVector is the vector that's being compared, and it also has 644 elements.

To speed up my program, I decided to take a look at the CUDA framework provided by NVIDIA. This is what my code looked like once I changed it to fit CUDA's specifications.

__global__ void findBMU(float * inputVector, float * weights, int * winner, float * leastDistance){




    int i = threadIdx.x+(blockIdx.x*blockDim.x);

    if(i<1000){

        int offset = i*644;
        int count = 0;
        float currentDistance = 0;
        for(int w = offset; w<offset+644; w++){
            currentDistance += abs((inputVector[count]-weights[w]))*abs((inputVector[count]-weights[w]));

            count++;
        }


        currentDistance = sqrt(currentDistance);

        count = 0;
        if(currentDistance<*leastDistance){
            *winner = offset;

            *leastDistance = currentDistance;

        }
        currentDistance = 0;
    }

}

To call the function, I used : findBMU<<<20, 50>>>(d_data, d_weights, d_winner, d_least);

But, when I would call the function, sometimes it would give me the right answer, and sometimes it wouldn't. After doing some research, I found that CUDA has some issues with reduction problems like these, but I couldn't find how to fix it. How can I modify my program to make it work with CUDA?

BTW, do a little bit more research about CUDA's execution model. It is almost always the best practice to pick your threads per block to be multiples of 32. Because threads are grouped into warps(of size 32), and threads in the same warp execute the same instruction at a given point in time. — Engin Kayraklioglu, Mar 08 '14 at 20:03

Roger Dahl · Accepted Answer · 2014-03-08T16:26:36.573

The issue is that threads that run concurrently will see the same leastDistance and overwrite each other's results. There are two values that are shared between threads; leastDistance and winner. You have two basic options. You can write out the results from all the threads and then do a second pass over the data with a parallel reduction to determine which vector had the best match or you can implement this with a custom atomic operation using atomicCAS().

The first method is the easiest. My guess is that it will also give you the best performance, though it does add a dependency for the the free Thrust library. You would use thrust::min_element().

The method using atomicCAS() uses the fact that atomicCAS() has a 64-bit mode, in which you can assign any semantics that you wish to a 64-bit value. In your case, you would use 32 bits to store leastDistance and 32 bits to store winner. To use this method, adapt this example in the CUDA C Programming Guide that implements a double precision floating point atomicAdd().

__device__ double atomicAdd(double* address, double val)
{
  unsigned long long int* address_as_ull =
  (unsigned long long int*)address;
  unsigned long long int old = *address_as_ull, assumed;
  do {
    assumed = old;
    old = atomicCAS(address_as_ull, assumed, __double_as_longlong(val + __longlong_as_double(assumed)));
  } while (assumed != old);
  return __longlong_as_double(old);
}

Isn't atomicMin() only overloaded to work with integers? – user1231745 Mar 08 '14 at 16:04 — user1231745, Mar 08 '14 at 16:04

Finding least number in a set with CUDA

1 Answers1