I need to synchronize intermediate solutions of an optimization problem solved distributively over a number of worker processors. The solution vector is known to be sparse.
I have noticed that if I use MPI_AllReduce, the performance is good compared to my own AllReduce implementation.
However, I believe, the performance can be further improved if AllReduce could communicate only the nonzero entries in the solution vector. I could not find any such implementation of AllReduce.
Any ideas?
It seems that MPI_type_indexed can not be used as the indices of the nonzero entries are not known in advance.