I have a lot of Monte-Carlo data that I need to process at a particular cluster. What I do is, for a given data sample (which is on average of size 70 GB), I run some statistics script in python on that data and I save it onto an hdf5 file, which reduces the over-all size of that data by 90%.
There is not much I can do to speed up the program as the files are so huge. Because of this, the time it takes for each sample to finish running is a long time.
To speed up the over-all processing, I run the following command
cat sampleList.txt | parallel -j 20 ipython myScript.py 2>&1 | tee logDir/myLog.txt
where the available number of cores is 36.
What ends up happening though is, over time, a certian number of these 20 processes get killed automatically. I don't necessarily have a problem with this. However, when one of these processes gets killed, the hdf5 file being written in that process becomes corrupted.
I was wondering if it were possible to have a flag in my python script that would force the data I wrote to close before the process gets terminated. Or maybe you guys have better alternatives.
What should I do? And thanks!