analyze text file in parallel with mpi4py

Question

I have an input tab separated text file:

I analyze it in plain Python as:

lines = open("songs.tsv").readlines()

def extract_hotness(line):
        return float(line.split()[1])

songs_hotness =map(extract_hotness, lines)
max_hotness = max(songs_hotness)

How do I perform the same operation in parallel using mpi4py? I started implementing this with scatter, but that won't work straight away because scatter needs list elements to be the same length as the number of nodes.

score 0 · Answer 1 · edited May 23 '17 at 11:51

Processing a text file in parallel is difficult. Where do you split the file? Are you even reading from a parallel file system? You might consider MPI-IO if you have a large enough input file. If you go that route, these answers, provided in a C context, describe the challenges that still hold in mpi4py: https://stackoverflow.com/a/31726730/1024740 and https://stackoverflow.com/a/12942718/1024740

Another approach is not to scatter the data but to read it all in on rank 0 and broadcast to everyone else. This approach requires enough memory to stage all the input data at once, or a master-worker scheme where only some data is read in one shot.

Thansk, I need this just as an example, so we can assume the file is tiny, I do not want to broadcast it but to scatter it, any splitting of the file is fine, I just do not want duplicate lines in different nodes. — Andrea Zonca, Aug 07 '15 at 19:59

analyze text file in parallel with mpi4py

1 Answers1