2

I have a large csv file and I need to process every row to count some words. I need to use some MPI approach to distribute data processing among multiple process. Currently, I'm using scatter/gather in mpi4py library. The problem is that I need to create an array with length equal to number of the processes. But I get a memory error when creating the list for large row counts.

size = comm.Get_size()
f=open('x.csv')
lines=[[] for _ in range(size)]
for line in f:
    # this line raises memory error after about 250000 rows are appended
    lines[i%size].append(line)

Is there another way to transfer large data among these processes?

Zulan
  • 21,896
  • 6
  • 49
  • 109
stardiv
  • 43
  • 1
  • 8
  • 1
    1) Does this run on multiple nodes? 2) Is this a shared file system? Generally it seems **very unlikely** that you actually gain performance from your approach rather than serially counting your words, as this is likely limited by disk I/O banwidth, not computation. – Zulan Apr 02 '16 at 18:44
  • @Zulan yes, I have multiple nodes and a shared file system. currently performance is not that much important for me, I just want to test and work with this MPI library on my file – stardiv Apr 02 '16 at 19:19

1 Answers1

1

You bascially have the following options:

  1. Process the data in chunks, e.g. read 200k rows, scatter, collect results, repeat.
  2. Read the data locally, e.g. 1/size of the file on each rank. This can be difficult to do efficiently. You cannot efficiently seek to a specific line in a csv file. So you have to separate the file by size, seek to the position where you split it, find the next newline and work from there until the first newline after the end of your part of the file.
  3. Combine both.

But then again, you could just process the file serially line by line, throwing away each line after you counted the words of it.

P.S. Consider the csv module.

Zulan
  • 21,896
  • 6
  • 49
  • 109