Is there any MPI function can do partial data exchange?

Question

I am trying to parallelize my image smoothing program recently

The algorithm to do this is easy to understand.

#define MAX_SMOOTH_LEVEL 1000
For i=0 to MAX_SMOOTH_LEVEL
    For each pixel in rgb
        Color rgb[IMG_HEIGHT][IMG_WIDTH], newrgb[IMG_HEIGHT][IMG_WIDTH];
        new_rgb[i][j]=(rgb[i][j]+rgb[i-1][j]+rgb[i+1][j]+rgb[i][j+1]+rgb[j-1])/5;
    Next
Next

if rgb[i-1][j] is out of bound(i-1<0), then use the color of rgb[imgHeight-1][j].. and so on....

The following picture is my idea to parallelize the computation. I think it is clear enough!

parallel

I would like to ask if my idea is reasonable? Do we really need to wait for all task to finish their work before next computation?

sorry for my bad English. I've tried hard to fix grammar errors

Congratulations, you've just reinvented [halo swapping](http://stackoverflow.com/a/17591795/1374437) :) — Hristo Iliev, Apr 02 '14 at 14:47
I would not parallelize this computation as you suggest. Instead, you should apply a 2D checkerboard domain decomposition. This will lead to much better parallel performances. — Massimo Cafaro, Apr 03 '14 at 08:03
@MassimoCafaro, could you please justify how decreasing the subdomain size and increasing the number of communications necessary will lead to much better parallel performance? — Hristo Iliev, Apr 03 '14 at 13:24
@HristoIliev, I am saying exactly the contrary. I am suggesting to agglomerate in two dimensions instead of using the 1D row-wise decomposition proposed in the question. Quoting from the book by Ian Foster: The communication requirements of a task are proportional to the surface of the subdomain on which it operates, while the computation requirements are proportional to the subdomain volume. In a two-dimensional problem, the surface scales with the problem size while the volume scales as the problem size squared. Hence, the communication/computation ratio decreases as task size increases. — Massimo Cafaro, Apr 03 '14 at 17:08
If the number of communication partners per task is small, we can often reduce both the number of communication operations and the total communication volume by increasing the granularity of our partition, that is, by agglomerating several tasks into one. The reduction in communication costs is due to a surface-to-volume effect. — Massimo Cafaro, Apr 03 '14 at 17:09
A consequence of surface-to-volume effects is that higher-dimensional decompositions are typically the most efficient, other things being equal, because they reduce the surface area (communication) required for a given volume (computation). Hence, from the viewpoint of efficiency it is usually best to increase granularity by agglomerating tasks in all dimensions rather than reducing the dimension of the decomposition — Massimo Cafaro, Apr 03 '14 at 17:10
So, since this is a trivial stencil update operation that it's always parallelized agglomerating in 2D or 3D (depending on the application), I really do not see any reason to stick with 1D row-wise domain decomposition, when one can use instead a 2D (since this is an image processing application) checkerboard decomposition. — Massimo Cafaro, Apr 03 '14 at 17:14
The theory in the book is correct for zero-latency networks. Reality is different. The transfer time of small messages is dominated by the MPI setup time and the network latency. Going from 1D to 2D in that particular case leads to the same amount of work per process but now instead of sending 2 messages of size `W`, each process sends 2 messages of size `W/n` and 2 messages of size `H/m`, where `m x n` is the size of the process grid. Unless sending `W - [W/n + H/m]` grid elements takes more time than is the end-to-end MPI latency, going from 1D to 2D will actually slow things down. — Hristo Iliev, Apr 03 '14 at 22:27
For example, with a square image (`W` = `H`) and 4 processes, going from 1D to 2D with square process grid will result in the same amount of data exchanged with the neighbours, but with twice the latency of the 1D case. Unless the author intends to run his program on hundreds of CPUs with very low-latency network, 2D decomposition will bring unnecessary complexity for such a simple task. — Hristo Iliev, Apr 03 '14 at 22:39
Indeed, I assume a very large input to be processed in parallel using a real supercomputer with a very low-latency network, such as the ones I use. I am NOT talking of self-assembled beowulf cluster connected by low-cost ethernet switches. No off-the-shelf components are involved in my reasoning ;-) — Massimo Cafaro, Apr 04 '14 at 07:17

Is there any MPI function can do partial data exchange?

0 Answers0