4

I have an existing bit of Python code that runs in parallel across the cores in my machine. The job it completing is basically open an input file, read the contents, perform some fairly heavy maths, write the results to an output a file, take the next file in the for loop and do it again. To make this parallel across many cores I make use of the Pool function in the multiprocessing library. As a quick example:

import multiprocessing
import time

data = (
['a', '2'], ['b', '4'], ['c', '6'], ['d', '8'],
['e', '1'], ['f', '3'], ['g', '5'], ['h', '7']
)

def mp_worker((inputs, the_time)):
    print " Processs %s\tWaiting %s seconds" % (inputs, the_time)
    time.sleep(int(the_time))
    print " Process %s\tDONE" % inputs

def mp_handler():
    p = multiprocessing.Pool(8)
    p.map(mp_worker, data)

if __name__ == '__main__':
    mp_handler()

This example is just used to show how I've implemented the multiprocessing.Pool function across 8 cores. In essence the mp_worker function in my code is much more complex but you get my drift.

I've come to realise that the network I'm working on has several machines sitting idle for 99% of their time. I therefore wondered if there is a way to make use of their cores as well as my local cores in this code.

In pseudo code the code could become something like:

def mp_handler():
    p = multiprocessing.Pool(servers=['local host', 192.168.0.1, 192.168.0.2], ncores=[8,8,4])
    p.map(mp_worker, data)

Where I can now specify both my local machine and other IP addresses as severs together with the number of cores I'd like to use on each machine.

Since the other machines on my network are owned by me and are not internet connected, I'm not fussed about using SSH for security purposes.

Googling around I've noticed that the pathos and scoop libraries may be able to help me with this. It looks like pathos has very similar commands to the multiprocessing library which really appeals to me. However, in both cases I can't find a simple example showing me how to convert my local parallel job into a distributed parallel job. I'm keen to stay as close to the Pool/map functionality of the multiprocessing library as possible.

Any help or examples would be much appreciated!

Mark
  • 1,277
  • 3
  • 13
  • 27

1 Answers1

3

The example from pathos is pretty much like your pseudo-code.

from pathos.parallel import stats
from pathos.parallel import ParallelPool as Pool
pool = Pool()

def host(id):
    import socket
    import time
    time.sleep(1.0)
    return "Rank: %d -- %s" % (id, socket.gethostname())


print "Evaluate 10 items on 2 cpus"
pool.ncpus = 2
pool.servers = ('localhost:5653',)
res5 = pool.map(host, range(10))
print pool
print '\n'.join(res5)
print stats()
print ''

Above, you could have set ncpus and servers as keywords when initializing the Pool instance.

The results look like this:

Evaluate 10 items on 2 cpus
<pool ParallelPool(ncpus=2, servers=('localhost:5653',))>
Rank: 0 -- hilbert.local
Rank: 1 -- hilbert.local
Rank: 2 -- hilbert.local
Rank: 3 -- hilbert.local
Rank: 4 -- hilbert.local
Rank: 5 -- hilbert.local
Rank: 6 -- hilbert.local
Rank: 7 -- hilbert.local
Rank: 8 -- hilbert.local
Rank: 9 -- hilbert.local
Job execution statistics:
 job count | % of all jobs | job time sum | time per job | job server
        10 |        100.00 |      10.0459 |     1.004588 | local
Time elapsed since server creation 5.0402431488
0 active tasks, 2 cores

If you have more than one server, with potentially remote servers, you just need to add more entries to the servers tuple. So that's not a perfect example, as it doesn't show exactly how to get servers going on another machine. However, it is a good example, if you ever do plan to use a ssh tunnel, you should know that you don't point pathos at the remote machine, but you point instead at localhost with the tunneled port… and that connects to the remote machine.

Since pathos uses ppft (which is a fork of pp), you can look at examples from pp on how to set up a remote server. Basically, you can do something like this with a shell script:

for i in $nodes
do
    ssh -f $i /home/username/bin/ppserver.py -p $portnum -w 2 -t 30 &
done

Here the loop is over the nodes received (nodes). For each node, a ssh -f command is used to start a ppserver with a specified port (-p), two workers (-w), and timeout after 30 seconds of idle (-t). See the pp documentation (http://www.parallelpython.com/content/view/15/30). With pathos, you only really need to start a ppserver and specify the port to make it work. Then, you'd add the hostname and port to the server tuple in the first block of code.

However, if you are adverse to setting up things manually, pathos does provide scripts that set up a tunnel and also a ppserver. Using a script is a little less flexible than doing so manually, and a bit more difficult to diagnose when things go wrong… but nonetheless… see the scripts here: https://github.com/uqfoundation/pathos/tree/master/scripts.

Mike McKerns
  • 33,715
  • 8
  • 119
  • 139
  • A few more things: (1) I'm the `pathos` author, (2) distributed computing is fairly fragile, so be forewarned that things will fail at some point, and leave a mess to clean up, (3) the cost of the function you are going to distribute has to be higher than the cost of making the connection to the distributed cluster, starting up a python instance, and tunneling over the objects, and (4) you have to have the same version of `ppft` installed on all machines or you get an error. – Mike McKerns Mar 23 '16 at 13:48