1

I have a lot of Monte-Carlo data that I need to process at a particular cluster. What I do is, for a given data sample (which is on average of size 70 GB), I run some statistics script in python on that data and I save it onto an hdf5 file, which reduces the over-all size of that data by 90%.

There is not much I can do to speed up the program as the files are so huge. Because of this, the time it takes for each sample to finish running is a long time.

To speed up the over-all processing, I run the following command

cat sampleList.txt | parallel -j 20 ipython myScript.py 2>&1 | tee logDir/myLog.txt

where the available number of cores is 36.

What ends up happening though is, over time, a certian number of these 20 processes get killed automatically. I don't necessarily have a problem with this. However, when one of these processes gets killed, the hdf5 file being written in that process becomes corrupted.

I was wondering if it were possible to have a flag in my python script that would force the data I wrote to close before the process gets terminated. Or maybe you guys have better alternatives.

What should I do? And thanks!

firest
  • 11
  • 1

1 Answers1

0

Would it make sense to close your file after each write:

while input
  compute
  if received TERM signal: exit
  open >>file.hd5
  write stuff
  close file.hd5
Ole Tange
  • 2,946
  • 6
  • 32
  • 47