What is the correct workflow for parallelization: running on a cluster or multiproccesses?

Question

I want to call a function similar to parallelize.map(function, args) that returns a list of results and the user is blind to the actual process. One of the functions I want to parallelize calls subprocess to another unix program that benefits from multiple cores.

I first tried ipython-cluster-helper. This works well with my setup, but I ran into problems installing it on several other machines. I also have to ask for names of clusters during setup. I haven't seen other programs start jobs on clusters for you, so I don't know if that is accepted practice.

joblib seems to be the standard for parallelization, but it can only use one cluster or computer at a time. This works as well, but is significantly slower because it is not using the cluster.

Also, the server I am running this code on complains if a program has run too long to ensure that people use the cluster. Do I write another script to run this program only on our cluster -- if I used joblib?

For now, I added special parameters in setup.py to add cluster names and install ipython-cluster-helper if necessary. And when map is called, it first checks if ipython-cluster-helper and the cluster names are available, use them, else use joblib.

What are others ways of achieving this? I'm looking for a standard way to do this that will work on most machines with or with out a cluster, so I can release the code and make it easy to use.

Thanks.

What is the correct workflow for parallelization: running on a cluster or multiproccesses?

0 Answers0