Basically, it is a materialized version of this post. Suppose a warp need to process 4 objects(say, pixels in image), each 8 lanes are grouped together to process one object:
Now I need do internal shuffle operations during processing one object(i.e. among 8 lanes of this object), it worked for each object just setting
mask
as 0xff
:
uint32_t mask = 0xff;
__shfl_up_sync(mask,val,1);
However, to my understanding, set mask
as 0xff
will force the lane0:lane7
of object0(or object3? also stuck on this point) to participate, but I ensure that above usage applies to each object after a mass of trials. So, my question is whether __shfl_up_sync
call can adapt argument mask
to force corresponding lanes participating?
Update
Actually, this problem came from codes of libSGM that I tried to parse. In particular, it solves minimal cost path with dynamic programming in a decently parallel way. Once program reaches this line after launching the kernel aggregate_vertical_path_kernel
with execution configuration:
//MAX_DISPARITY is 128 and BLOCK_SIZE is 256
//Basically, each block serves to process 32 pixels in which each warp serves to process 4.
const int gdim = (width + PATHS_PER_BLOCK - 1) / PATHS_PER_BLOCK;
const int bdim = BLOCK_SIZE;
aggregate_vertical_path_kernel<1, MAX_DISPARITY><<<gdim, bdim, 0, stream>>>(...)
An object dp
is instantiated from DynamicProgramming<DP_BLOCK_SIZE, SUBGROUP_SIZE>
:
static constexpr unsigned int DP_BLOCK_SIZE = 16u;
...
//MAX_DISPARITY is 128
static const unsigned int SUBGROUP_SIZE = MAX_DISPARITY / DP_BLOCK_SIZE;
...
DynamicProgramming<DP_BLOCK_SIZE, SUBGROUP_SIZE> dp;
Keep following the program, dp.updata()
will be invoked in which __shfl_up_sync
is used to access the last element of previous DP_BLOCK
and __shfl_down_sync
is used to access the first element of the rear DP_BLOCK
. Besides, each 8 lanes in one warp are grouped together:
//So each 8 threads are grouped together to process one pixel in which each lane is contributed to one DP_BLOCK for corresponding pixel.
const unsigned int lane_id = threadIdx.x % SUBGROUP_SIZE;
Here it comes, once program reaches this line:
//mask is specified as 0xff(255)
const uint32_t prev =__shfl_up_sync(mask, dp[DP_BLOCK_SIZE - 1], 1);
each lane in one warp does shuffle with the same mask 0xff
, which causes my above question.