"Global array" parallel programming on distributed memory clusters with python

Question

I am looking for a python library which extends the functionality of numpy to operations on a distributed memory cluster: i.e. "a parallel programming model in which the programmer views an array as a single global array rather than multiple, independent arrays located on different processors."

For Matlab MIT's Lincoln Lab has created pMatlab which allows to do matrix algebra on a cluster without worrying too much about the details of the parallel programming aspect. (Origin of above quote.)

For disk-based storage, pyTables exist for python. Though it does not optimise how calculations are distributed in a cluster but rather how calculations are "distributed" with respect to large data on a disk. - Which is reasonably similar but still missing a crucial aspect.

The aim is not to squeeze the last bit of performance from a cluster but to do scientific calculations (semi-interactively) that are too large for single machines.

Does something similar exist for python? My wishlist would be:

actively maintained
drop in replacement for numpy
alternatively similar usage to numexpr
high abstraction of the parallel programming part: i.e. no need for the user to explicitly use MPI
support for data-locality in distributed memory clusters
support for multi-core machines in the cluster

This is probably a bit like believing in the tooth-fairy but one never knows...

I have found so far:

There (exists/used to exist) a python interface for Global Array by the Pacific Northwest National Laboratory. See the links under the topic "High Performance Parallel Computing in Python using NumPy and the Global Arrays Toolkit". (Especially "GA_SciPy2011_Tutorial.pdf".) However this seems to have disappeared again.
DistNumPy: described more in detail in this paper. However the projects appears to have been abandoned.

If you know of any package or have used any of the two above, please describe your experiences with them.

score 0 · Answer 1 · answered Jan 28 '13 at 13:11

0

You should take a look at Blaze, although it may not be far enough along in development to suit your needs at the moment. From the linked page:

Blaze is an expressive, compact set of foundational abstractions for composing computations over large amounts of semi-structured data, of arbitrary formats and distributed across arbitrary networks.

answered Jan 28 '13 at 13:11

bogatron

18,639
6
53
47

Do you know if any of the "network" part is implemented yet? All the examples seem to cover disk-based operations. – ARF Jan 28 '13 at 20:41
Sorry, I don't have any details regarding the network implementation status. – bogatron Jan 30 '13 at 15:16

"Global array" parallel programming on distributed memory clusters with python

1 Answers1