I have been trying to speed up a binary tree evaluation algo using AVX2. Actually, I'm using Agner's VCL lib since the difference between hand-coding the algo and using vcl was small for big gain in readability.
I have a list of trees that need to be evaluated. I put the list of indexes in an array 'vTreeIndexes'. This is the static scheduling part. As soon as a lane is done, I collect its result using a mask then I load new indexes using vTreeIndexes.
// detect finished lanes
const vcl::Vec8ib mask = (nodeIndex < zero); // nodeIndex holds 8 indexes to some nodes from 8 different trees. When an index is negative, the tree has been evaluated
// get compact mask
const __mmask8 k = vcl::to_bits(mask);
// load next 8 indexes from static scheduling array
vcl::Vec8i nextIndexes;
nextIndexes.load(vTreeIndexes + nextTreeIndex);
// load shuffle mask
const vcl::Vec8i shuffleVector(ShuffleHelper::loadMasks[k]);
// shuffle (1)
nextIndexes = vcl::lookup8(shuffleVector, nextIndexes);
// replace finished lanes
nodeIndex = vcl::select(mask, nextIndexes, nodeIndex);
(1) : we have 8 tree indexes T0, T1, ... T7 in 'nextIndexes'. Some lanes are finished (let's say lanes 2 and 5 are finished) so we want T0 to be shuffled to lane 2 and T1 to be shuffled to lane 5. The other indexes, we don't care. The mask will only replace the finished lanes 2 and 5.
This part works well, now building on this approach, I want to introduce "dynamic scheduling", ie, when a tree finishes its evaluation it may indicate the need to evaluate 2 more "conditional trees". So I now need to "blend" static and dynamic scheduling, and this is where things are not great.
My first working solution uses another C array to hold temporary "pending" trees :
// part1 - load continuation trees and store them in pendingFifo
//
// load "continuation trees"
vcl::Vec8i continuationTrees = vcl::lookup<16384>(prevNodeIndex, vNodeContinuationTrees);
// check which one are set
const vcl::Vec8ib continuationMask = continuationTrees > zero;
const __mmask8 continuationK = vcl::to_bits(continuationMask);
const int nbContinuations = __builtin_popcount(continuationK);
// "pack left" : move the non-zero one on the left, the zero one on the right
const vcl::Vec8i packLeftShuffleVector(ShuffleHelper::packLeftMasks[continuationK]);
continuationTrees = vcl::lookup8(packLeftShuffleVector, continuationTrees);
// each index holds a left continuation
const vcl::Vec8i leftContinuationTrees = continuationTrees & 0x0FFFF;
// and a right continuation
const vcl::Vec8i rightContinuationTrees = continuationTrees >> 16;
// concat all those indexes in the pendingFifo array
leftContinuationTrees.store(pendingFifo.data() + pendingTail);
rightContinuationTrees.store(pendingFifo.data() + pendingTail + nbContinuations);
pendingTail += nbContinuations * 2;
// part 2 : priority1 : load from pendingFifo array, priority 2 : complete from static scheduling array
// see how much elements we're going to load from each array
const int availableInPending = pendingTail - pendingHead;
const int loadedFromPending = std::min(availableInPending, nbFinishedTrees);
const int loadedFromNextIndexes = nbFinishedTrees - loadedFromPending; // the complement
// prepare some shift / mask
const vcl::Vec8i shift = id - loadedFromPending; // id is the vector {0, 1, 2, 3, 4, 5, 6, 7}
const vcl::Vec8ib blendMask = (shift < zero);
// load from pending
vcl::Vec8i nextIndexesFromPending;
nextIndexesFromPending.load(pendingFifo.data() + pendingHead);
// we keep only the first 'loadedFromPending' elements
nextIndexesFromPending = vcl::select(blendMask, nextIndexesFromPending, zero);
// load from statically scheduled
vcl::Vec8i nextIndexesFromScheduled;
nextIndexesFromScheduled.load(vScheduledTreeIndexes + nextTreeIndex);
// we right-shift the vector by loadedFromPending and zero out the first loadedFromPending elements
nextIndexesFromScheduled = vcl::select(~blendMask, vcl::lookup8(shift, nextIndexesFromScheduled), zero);
// that way we can concat both vectors together, having
// - loadedFromPending elements from pending on the left, and
// - 8-loadedFromPending elements from scheduled on the right
vcl::Vec8i nextIndexes = nextIndexesFromPending | nextIndexesFromScheduled;
// and as before we shuffle this vector to replace the finished lanes
const vcl::Vec8i loadShuffleVector(ShuffleHelper::loadMasks[k]);
nextIndexes = vcl::lookup8(loadShuffleVector, nextIndexes);
nodeIndex = vcl::select(mask, nextIndexes, nodeIndex);
This code works, but I'm pretty sure there is a better solution out there. I've been trying to replace the pendingFifo array by using a vector variable (that maybe could stay in a register), but this is slower. Probably the way I "blend" the indexes together is no good. I'm pretty new to vector programming, any hint on how to improve the above solution welcome !
PS : direct intel intrinsics usage welcome as well