I want to balance a set of matrix blocks between processes. The matrix blocks have different sizes, although typically one single block dominates, being of similar size or even larger than all the other blocks combined. The number of processes may be anywhere between much larger and much smaller than the number of blocks. Each block can be stored either on a single process or distributed as a ScaLAPACK array. The balancing should qualitatively fulfill the following conditions:
- No process should receive much more matrix elements than target_load = sum(size(blocks[:])) / n_procs
- No block should be distributed over much more processes than size(block) / target_load. MPI communicators may be split off from mpi_comm_world
- MPI communicators cannot overlap (blocks 1 and 2 both being distributed over processes 0:4 is fine, but block 1 being distributed over processes 0:3 and block 2 being distributed over processes 2:5 is not ok; undistributed blocks may be stacked arbitrarily on top of distributed blocks)
I am aware that such a distribution will depend on how strongly and in which priority the first two conditions are applied (the third condition should apply strictly). Nonetheless, is there any algorithm that facilitates some interpretation of these conditions?