I have a distributed array of rankings, of total size N
irregularly distributed amongst NP
processors, from which I need to extract the K
largest elements. In the limit that K << N
, K
is smaller than any of the local buffer lengths, and K
is relatively small in general (such that it can e.g. fit in reasonable MPI buffers), the following algorithm seems to work well
- Perform a local top-K search to determine the largest
K
values in each local array segment - Perform a custom Allreduce which performs binary top-K reductions between buffers of size
K
coming from different processes.
This can be done in a semi-communication-optimal way given the communication patterns underlying MPI_Allreduce
.
I'm unclear how this can be done efficiently without the above assumptions about the size of K
relative to N
and the local buffer sizes. In particular, I'm trying to determine an optimal (or reasonably scaling) algorithm that is compatible with the following:
K
can be larger than some or all of the local buffer dimensionsK
can be so large is to be impractical to communicate entirely (e.g. trying to determine the top billion elements of a 10 billon element array)
The full array nor the top-K
elements need to be sorted on completion.
For arrays which reside on a single processing element, the following questions are related:
Collect top K elements from multiple sorted arrays
Average time complexity of finding top-k elements
Optimal algorithm for returning top k values from an array of length N