1

I have an input tab separated text file:

0   .4
1   .9
2   .2
3   .12
4   .55
5   .98

I analyze it in plain Python as:

lines = open("songs.tsv").readlines()

def extract_hotness(line):
        return float(line.split()[1])

songs_hotness =map(extract_hotness, lines)
max_hotness = max(songs_hotness)

How do I perform the same operation in parallel using mpi4py? I started implementing this with scatter, but that won't work straight away because scatter needs list elements to be the same length as the number of nodes.

Andrea Zonca
  • 8,378
  • 9
  • 42
  • 70

1 Answers1

0

Processing a text file in parallel is difficult. Where do you split the file? Are you even reading from a parallel file system? You might consider MPI-IO if you have a large enough input file. If you go that route, these answers, provided in a C context, describe the challenges that still hold in mpi4py: https://stackoverflow.com/a/31726730/1024740 and https://stackoverflow.com/a/12942718/1024740

Another approach is not to scatter the data but to read it all in on rank 0 and broadcast to everyone else. This approach requires enough memory to stage all the input data at once, or a master-worker scheme where only some data is read in one shot.

Community
  • 1
  • 1
Rob Latham
  • 5,085
  • 3
  • 27
  • 44
  • Thansk, I need this just as an example, so we can assume the file is tiny, I do not want to broadcast it but to scatter it, any splitting of the file is fine, I just do not want duplicate lines in different nodes. – Andrea Zonca Aug 07 '15 at 19:59