I am looking for a python library which extends the functionality of numpy to operations on a distributed memory cluster: i.e. "a parallel programming model in which the programmer views an array as a single global array rather than multiple, independent arrays located on different processors."
For Matlab MIT's Lincoln Lab has created pMatlab which allows to do matrix algebra on a cluster without worrying too much about the details of the parallel programming aspect. (Origin of above quote.)
For disk-based storage, pyTables exist for python. Though it does not optimise how calculations are distributed in a cluster but rather how calculations are "distributed" with respect to large data on a disk. - Which is reasonably similar but still missing a crucial aspect.
The aim is not to squeeze the last bit of performance from a cluster but to do scientific calculations (semi-interactively) that are too large for single machines.
Does something similar exist for python? My wishlist would be:
- actively maintained
- drop in replacement for numpy
- alternatively similar usage to numexpr
- high abstraction of the parallel programming part: i.e. no need for the user to explicitly use MPI
- support for data-locality in distributed memory clusters
- support for multi-core machines in the cluster
This is probably a bit like believing in the tooth-fairy but one never knows...
I have found so far:
There (exists/used to exist) a python interface for Global Array by the Pacific Northwest National Laboratory. See the links under the topic "High Performance Parallel Computing in Python using NumPy and the Global Arrays Toolkit". (Especially "GA_SciPy2011_Tutorial.pdf".) However this seems to have disappeared again.
DistNumPy: described more in detail in this paper. However the projects appears to have been abandoned.
If you know of any package or have used any of the two above, please describe your experiences with them.