1

I have extremely large file need be to processed. Also perfect time to learn parallelism. My idea (potentially wrong) was to read file in chunks. Not one at a time, but multiple chunks at once (simultaneously). If I allocate 4 cores, than for file with 1000 rows and chunk with 100 rows machine would process like:

First round: chunk1_1(0-100), chunk1_2(101-200), chunk1_3(201-300), chunk1_4(301-400)

Second round: chunk2_1(401-500), chunk2_2(501-600), chunk2_3(601-700), chunk2_4(701-800) and so on..

I made multiprocessing code, but it gives only slight improvement in time (~10 seconds) over single core for the file with 1M rows. Here mean function is used as playground. Any ideas on bottleneck I have? Is my understanding wrong as well?

import multiprocessing as mp
import pandas as pd
import sys

def pre_mean(dat):
    sum_x = 0
    sum_y = 0
    n = len(dat[0])
    for index, row in dat.iterrows():
        sum_x += row[0]
        sum_y += row[1]
    return n, sum_x, sum_y

if __name__ == '__main__':
    
    reader = pd.read_table(sys.argv[1],delim_whitespace=True,chunksize=10000,header=None)
    pool = mp.Pool(4) # use 4 processes
    n_list = []
    x_list = []
    y_list = []
    for chunk in reader:
        n, sum_x, sum_y = pool.apply_async(pre_mean, [chunk]).get()
        
        n_list.append(n)
        x_list.append(sum_x)
        y_list.append(sum_y)
    
    mean_x = sum(x_list) / sum(n_list)
    mean_y = sum(y_list) / sum(n_list)
Zoomman
  • 37
  • 5

0 Answers0