I have three files and each contain close to 300k records. Have written a python script to process those files with some business logic and able to create the output file successfully. This process completes in 5 mins.
I am using the same script to process the files with high of volume of data (All the three input files contain about 30 million records). Now the processing taking hours and kept running for very long time.
So I am thinking of breaking the file into 100 small chunks based on the last two digits of the unique id and having it processed parallels. Are there any data pipeline packages that I could use to perform this?
BTW, I am running this process in my VDI machine.