Suppose you had a function that would take in a vector, a set of vectors, and find which vector in the set of vectors was closest to the original vector. It may be useful if I included some code:
int findBMU(float * inputVector, float * weights){
int count = 0;
float currentDistance = 0;
int winner = 0;
float leastDistance = 99999;
for(int i = 0; i<10; i++){
for(int j = 0;j<10; j++){
for(int k = 0; k<10; k++){
int offset = (i*100+j*10+k)*644;
for(int i = offset; i<offset+644; i++){
currentDistance += abs((inputVector[count]-weights[i]))*abs((inputVector[count]-weights[i]));
count++;
}
currentDistance = sqrt(currentDistance);
count = 0;
if(currentDistance<leastDistance){
winner = offset;
leastDistance = currentDistance;
}
currentDistance = 0;
}
}
}
return winner;
}
In this example, weights
is a single dimensional array, with a block of 644 elements corresponding to one vector. inputVector
is the vector that's being compared, and it also has 644 elements.
To speed up my program, I decided to take a look at the CUDA framework provided by NVIDIA. This is what my code looked like once I changed it to fit CUDA's specifications.
__global__ void findBMU(float * inputVector, float * weights, int * winner, float * leastDistance){
int i = threadIdx.x+(blockIdx.x*blockDim.x);
if(i<1000){
int offset = i*644;
int count = 0;
float currentDistance = 0;
for(int w = offset; w<offset+644; w++){
currentDistance += abs((inputVector[count]-weights[w]))*abs((inputVector[count]-weights[w]));
count++;
}
currentDistance = sqrt(currentDistance);
count = 0;
if(currentDistance<*leastDistance){
*winner = offset;
*leastDistance = currentDistance;
}
currentDistance = 0;
}
}
To call the function, I used : findBMU<<<20, 50>>>(d_data, d_weights, d_winner, d_least);
But, when I would call the function, sometimes it would give me the right answer, and sometimes it wouldn't. After doing some research, I found that CUDA has some issues with reduction problems like these, but I couldn't find how to fix it. How can I modify my program to make it work with CUDA?