Splitting file into small chunks and processing

Question

I have three files and each contain close to 300k records. Have written a python script to process those files with some business logic and able to create the output file successfully. This process completes in 5 mins.

I am using the same script to process the files with high of volume of data (All the three input files contain about 30 million records). Now the processing taking hours and kept running for very long time.

So I am thinking of breaking the file into 100 small chunks based on the last two digits of the unique id and having it processed parallels. Are there any data pipeline packages that I could use to perform this?

BTW, I am running this process in my VDI machine.

score 0 · Answer 1 · answered Jun 21 '19 at 02:45

0

I am not sure of any API as such for the function.But you can try multiprocessing and multithreading to process large volume of data

answered Jun 21 '19 at 02:45

Akalya Raj

68
1
7

Splitting file into small chunks and processing

1 Answers1