I'm currently working with a program that is written in C with MPI parallelization. The computation grid is divided up using common domain decomposition techniques. The processes layout is as follows with respect to the 2D decomposition (simplified):
/////////////////////////////
/ / / /
/ 1 / 2 / 3 /
/ / / /
/////////////////////////////
/ / / /
/ 4 / 5 / 6 /
/ / / /
/////////////////////////////
/ / / /
/ 7 / 8 / 9 /
/ / / /
/////////////////////////////
At one point in the code, I have to solve an series of equations which have X-dependencies only. With the topology in it's current form, it can only be parallelized with 3 processes at a time, due to the x-dependency, which leads to my question...is there a convenient/efficient way to map the current topology to another one within the code, which favors full parallelization i.e. using all 9 processes? For example, something like this:
/////////////////////////////
/ 1 /
/////////////////////////////
/ 2 /
/////////////////////////////
/ 3 /
/////////////////////////////
/ 4 /
/////////////////////////////
/ 5 /
/////////////////////////////
/ 6 /
/////////////////////////////
/ 7 /
/////////////////////////////
/ 8 /
/////////////////////////////
/ 9 /
/////////////////////////////
One might ask, why not start with this...well the 2D domain decomposition is much more efficient for the overall problem plus I also have y-dependencies later where I need to do something similar with the topology, thus the image above would be transposed.
So, I need to map the 2D topology to the 1D topology within the code (on the fly) using some communication routines to enable full parallelization with 9 processes, but I'm not sure if there is an efficient and effective way of doing this VS running the original problem with 3 processes in parallel. Any suggestions would be helpful. Thanks!!