I have extremely large file need be to processed. Also perfect time to learn parallelism. My idea (potentially wrong) was to read file in chunks. Not one at a time, but multiple chunks at once (simultaneously). If I allocate 4 cores, than for file with 1000 rows and chunk with 100 rows machine would process like:
First round: chunk1_1(0-100), chunk1_2(101-200), chunk1_3(201-300), chunk1_4(301-400)
Second round: chunk2_1(401-500), chunk2_2(501-600), chunk2_3(601-700), chunk2_4(701-800) and so on..
I made multiprocessing code, but it gives only slight improvement in time (~10 seconds) over single core for the file with 1M rows. Here mean function is used as playground. Any ideas on bottleneck I have? Is my understanding wrong as well?
import multiprocessing as mp
import pandas as pd
import sys
def pre_mean(dat):
sum_x = 0
sum_y = 0
n = len(dat[0])
for index, row in dat.iterrows():
sum_x += row[0]
sum_y += row[1]
return n, sum_x, sum_y
if __name__ == '__main__':
reader = pd.read_table(sys.argv[1],delim_whitespace=True,chunksize=10000,header=None)
pool = mp.Pool(4) # use 4 processes
n_list = []
x_list = []
y_list = []
for chunk in reader:
n, sum_x, sum_y = pool.apply_async(pre_mean, [chunk]).get()
n_list.append(n)
x_list.append(sum_x)
y_list.append(sum_y)
mean_x = sum(x_list) / sum(n_list)
mean_y = sum(y_list) / sum(n_list)