I’m working on some task related to graph traversal (Viterbi algorithm).
Each time step I have a compacted set of active states, some job is done in each state, and than results are propagated through outgoing arcs to each arc’s destination state and so new active set of states is built.
The problem is that number of outgoing arcs varies very heavily, from two or three to several thousands. So compute threads are loaded very ineffectively.
I try to share the job through shared local memory queue
int tx = threadIdx.x;
extern __shared__ int smem[];
int *stateSet_s = smem; //new active set
int *arcSet_s = &(smem[Q_LEN]); //local shared queue
float *scores_s = (float*)&(smem[2*Q_LEN]);
__shared__ int arcCnt;
__shared__ int stateCnt;
if ( tx == 0 )
{
arcCnt = 0;
stateCnt = 0;
}
__syncthreads();
//load state index from compacted list of state indexes
int stateId = activeSetIn_g[gtx];
float srcCost = scores_g[ stateId ];
int startId = outputArcStartIds_g[stateId];
int nArcs = outputArcCounts_g[stateId]; //number of outgoing arcs to be propagated (2-3 to thousands)
/////////////////////////////////////////////
/// prepare arc set
/// !!!! that is the troubled code I think !!!!
/// bank conflicts? uncoalesced access?
int myPos = atomicAdd ( &arcCnt, nArcs );
while ( nArcs > 0 ) && ( myPos < Q_LEN ) )
{
scores_s[myPos] = srcCost;
arcSet_s[myPos] = startId + nArcs - 1;
myPos++;
nArcs--;
}
__syncthreads();
//////////////////////////////////////
/// parallel propagate arc set
if ( arcSet_s[tx] > 0 )
{
FstArc arc = arcs_g[ arcSet_s[tx] ];
float srcCost_ = scores_s[tx];
DoSomeJob ( &srcCost_ );
int *dst = &(transitionData_g[arc.dst]);
int old = atomicMax( dst, FloatToInt ( srcCost_ ) );
////////////////////////////////
//// new active set
if ( old == ILZERO )
{
int pos = atomicAdd ( &stateCnt, 1 );
stateSet_s[ pos ] = arc.dst;
}
}
/////////////////////////////////////////////
/// transfer new active set from smem to gmem
__syncthreads();
__shared__ int gPos;
if ( tx == 0 )
{
gPos = atomicAdd ( activeSetOutSz_g, stateCnt );
}
__syncthreads();
if ( tx < stateCnt )
{
activeSetOut_g[gPos + tx] = stateSet_s[tx];
}
__syncthreads();
But it runs very slow, I mean slower then if no active set is used (active set = all states), though active set is 10 – 15 percent of all states. Register pressure raised heavily, occupancy is low, but I don’t think anything can be done about it.
May be there are more effective ways of job sharing among threads? A think about warp-shuffle ops on 3.0, but I have to use 2.x devices.