I'm using the implementation of the parallel reduction on CUDA using new kepler's shuffle instructions, similar to this: http://devblogs.nvidia.com/parallelforall/faster-parallel-reductions-kepler/
I was searching for the minima of rows in a given matrix, and in the end of the kernel I had the following code:
my_register = min(my_register, __shfl_down(my_register,8,16));
my_register = min(my_register, __shfl_down(my_register,4,16));
my_register = min(my_register, __shfl_down(my_register,2,16));
my_register = min(my_register, __shfl_down(my_register,1,16));
My blocks are 16*16, so everything worked fine, with that code I was getting minima in two sub-rows in the very same kernel.
Now I also need to return the indices of the smallest elements in every row of my matrix, so I was going to replace "min" with the "if" statement and handle these indices in a similar fashion, I got stuck at this code:
if (my_reg > __shfl_down(my_reg,8,16)){my_reg = __shfl_down(my_reg,8,16);};
if (my_reg > __shfl_down(my_reg,4,16)){my_reg = __shfl_down(my_reg,4,16);};
if (my_reg > __shfl_down(my_reg,2,16)){my_reg = __shfl_down(my_reg,2,16);};
if (my_reg > __shfl_down(my_reg,1,16)){my_reg = __shfl_down(my_reg,1,16);};
No cudaErrors whatsoever, but kernel returns trash now. Nevertheless I have fix for that:
myreg_tmp = __shfl_down(myreg,8,16);
if (myreg > myreg_tmp){myreg = myreg_tmp;};
myreg_tmp = __shfl_down(myreg,4,16);
if (myreg > myreg_tmp){myreg = myreg_tmp;};
myreg_tmp = __shfl_down(myreg,2,16);
if (myreg > myreg_tmp){myreg = myreg_tmp;};
myreg_tmp = __shfl_down(myreg,1,16);
if (myreg > myreg_tmp){myreg = myreg_tmp;};
So, allocating new tmp variable to sneak into neighboring registers saves everything for me. Now the question: Are the kepler shuffle instructions destructive ? in a sense that invoking same instruction twice doesn't issue the same result. I haven't assigned anything to those registers saying "my_reg > __shfl_down(my_reg,8,16)" - this adds up to my confusion. Can anyone explain me what is the problem with invoking shuffle twice? I'm pretty much a newbie in CUDA, so detailed explanation for dummies is welcomed