I have a large csv file and I need to process every row to count some words. I need to use some MPI approach to distribute data processing among multiple process. Currently, I'm using scatter/gather in mpi4py
library. The problem is that I need to create an array with length equal to number of the processes. But I get a memory error when creating the list for large row counts.
size = comm.Get_size()
f=open('x.csv')
lines=[[] for _ in range(size)]
for line in f:
# this line raises memory error after about 250000 rows are appended
lines[i%size].append(line)
Is there another way to transfer large data among these processes?