AVX2/VCL : static/dynamic lane scheduling

Question

I have been trying to speed up a binary tree evaluation algo using AVX2. Actually, I'm using Agner's VCL lib since the difference between hand-coding the algo and using vcl was small for big gain in readability.

I have a list of trees that need to be evaluated. I put the list of indexes in an array 'vTreeIndexes'. This is the static scheduling part. As soon as a lane is done, I collect its result using a mask then I load new indexes using vTreeIndexes.

// detect finished lanes
const vcl::Vec8ib mask = (nodeIndex < zero); // nodeIndex holds 8 indexes to some nodes from 8 different trees. When an index is negative, the tree has been evaluated
// get compact mask
const __mmask8    k    = vcl::to_bits(mask);
// load next 8 indexes from static scheduling array
vcl::Vec8i nextIndexes;
nextIndexes.load(vTreeIndexes + nextTreeIndex);
// load shuffle mask
const vcl::Vec8i shuffleVector(ShuffleHelper::loadMasks[k]);
// shuffle (1)
nextIndexes = vcl::lookup8(shuffleVector, nextIndexes);
// replace finished lanes
nodeIndex = vcl::select(mask, nextIndexes, nodeIndex);

(1) : we have 8 tree indexes T0, T1, ... T7 in 'nextIndexes'. Some lanes are finished (let's say lanes 2 and 5 are finished) so we want T0 to be shuffled to lane 2 and T1 to be shuffled to lane 5. The other indexes, we don't care. The mask will only replace the finished lanes 2 and 5.

This part works well, now building on this approach, I want to introduce "dynamic scheduling", ie, when a tree finishes its evaluation it may indicate the need to evaluate 2 more "conditional trees". So I now need to "blend" static and dynamic scheduling, and this is where things are not great.

My first working solution uses another C array to hold temporary "pending" trees :

// part1 - load continuation trees and store them in pendingFifo
//
// load "continuation trees"
vcl::Vec8i        continuationTrees = vcl::lookup<16384>(prevNodeIndex, vNodeContinuationTrees);
// check which one are set
const vcl::Vec8ib continuationMask  = continuationTrees > zero;
const __mmask8    continuationK     = vcl::to_bits(continuationMask);
const int         nbContinuations   = __builtin_popcount(continuationK);
// "pack left" : move the non-zero one on the left, the zero one on the right
const vcl::Vec8i  packLeftShuffleVector(ShuffleHelper::packLeftMasks[continuationK]);
continuationTrees                       = vcl::lookup8(packLeftShuffleVector, continuationTrees);
// each index holds a left continuation
const vcl::Vec8i leftContinuationTrees  = continuationTrees & 0x0FFFF;
// and a right continuation
const vcl::Vec8i rightContinuationTrees = continuationTrees >> 16;
// concat all those indexes in the pendingFifo array
leftContinuationTrees.store(pendingFifo.data() + pendingTail);
rightContinuationTrees.store(pendingFifo.data() + pendingTail + nbContinuations);
pendingTail += nbContinuations * 2;

// part 2 : priority1 : load from pendingFifo array, priority 2 : complete from static scheduling array
// see how much elements we're going to load from each array
const int availableInPending    = pendingTail - pendingHead;
const int loadedFromPending     = std::min(availableInPending, nbFinishedTrees);
const int loadedFromNextIndexes = nbFinishedTrees - loadedFromPending; // the complement

// prepare some shift / mask
const vcl::Vec8i  shift     = id - loadedFromPending; // id is the vector {0, 1, 2, 3, 4, 5, 6, 7}
const vcl::Vec8ib blendMask = (shift < zero);

// load from pending
vcl::Vec8i nextIndexesFromPending;
nextIndexesFromPending.load(pendingFifo.data() + pendingHead);
// we keep only the first 'loadedFromPending' elements
nextIndexesFromPending = vcl::select(blendMask, nextIndexesFromPending, zero);

// load from statically scheduled
vcl::Vec8i nextIndexesFromScheduled;
nextIndexesFromScheduled.load(vScheduledTreeIndexes + nextTreeIndex);
// we right-shift the vector by loadedFromPending and zero out the first loadedFromPending elements
nextIndexesFromScheduled = vcl::select(~blendMask, vcl::lookup8(shift, nextIndexesFromScheduled), zero);

// that way we can concat both vectors together, having
// - loadedFromPending elements from pending on the left, and 
// - 8-loadedFromPending elements from scheduled on the right
vcl::Vec8i nextIndexes = nextIndexesFromPending | nextIndexesFromScheduled;

// and as before we shuffle this vector to replace the finished lanes
const vcl::Vec8i loadShuffleVector(ShuffleHelper::loadMasks[k]);
nextIndexes = vcl::lookup8(loadShuffleVector, nextIndexes);
nodeIndex = vcl::select(mask, nextIndexes, nodeIndex);

This code works, but I'm pretty sure there is a better solution out there. I've been trying to replace the pendingFifo array by using a vector variable (that maybe could stay in a register), but this is slower. Probably the way I "blend" the indexes together is no good. I'm pretty new to vector programming, any hint on how to improve the above solution welcome !

PS : direct intel intrinsics usage welcome as well

At first glance seems like a similar problem to SIMD Mandelbrot, except in that case you can keep iterating an "escaped" point without harm, no risk of segfaults like your gather loads. If you're already using `__mmask8`, can you use other AVX-512 masking features to mask your loads, maybe only checking every few iterations if it's time to replace some elements that have reached the end of their chain of pointers? — Peter Cordes, Sep 13 '22 at 22:12
Are you refering to https://github.com/skeeto/mandel-simd ? I'm not sure I see what you're referring to in the code ? Also, maybe my question is not clear enough : what I think is not efficient is - the use of a c array to hold the pending "continuation trees". I should be able to replace it with a Vec32s or a Vec16s. - the way I fill the pending c array using 2 overlapping stores - the way I then create nextIndexes by blending elements from the pending array and the scheduled array (shift + mask) This code is approx twice as slow as the "static scheduling only" version. — David Jobet, Sep 14 '22 at 07:18
I think I guessed wrong about how your code works; I didn't take the time to really understand it, just kind of guessed how it might work from skimming the text, picturing tree traversal via pointers. Sorry about that. — Peter Cordes, Sep 14 '22 at 07:32

AVX2/VCL : static/dynamic lane scheduling

0 Answers0