So, I have some 30 files each 1 GB in size and I am reading them sequentially on a MAC with 16 GB RAM and 4 CPU Cores. The processing of each one of them is independent of the others. It is taking almost 2 hours to complete the processing. Each file has data for a day (time series/24hrs). So there is 30 days of data. After processing I am appending the output to a file day wise (i.e. Day 1, Day 2, and so on).
Can I solve this problem using Multiprocessing? Does it have any side effect (like Thrashing etc.)? Also it would be great if some one can guide me on the pattern. I read about Multiprocess, Pools and imap but it is still not clear to me as to how to write to the file sequentially (i.e. day wise).
My approach (either one of the below):
- Use imap to get ordered output as I am looking for. OR
- To write individual output files for each input files and then merge them into one by sorting.
Is there a better pattern to solve this problem? Do I need to use a queue here? Confused!