I have a script that has to kick off 2 independent processes, and wait until one of them finishes before continuing.
Up to now, I've run it by creating one process with an if fork pid == 0, exec, else wait
. The other one is created using system
and the command line.
Now I'm preparing to roll this script out to run 400 iterations of such work-pair processes on Platform Load Sharing Facility (LSF), however I'm concerned with stability. I know that the processes can crash. In such a case, I need a method to know when a process has crashed, and kill its pair process and the main script.
Originally I had written a watchdog with a 3 minute watch period, if 3 minutes of inactivity pass, it kills the processes. However this caught a lot of false positives, because when the LSF suspends one of the two processes, the watchdog saw them as inactive.
In LSF, when I issue the jobs, I have the option to kill them. However, when I kill a job, what exactly do I kill? Will the kill take down the two processes the Perl script has created? or leave them running as zombies?
To reiterate,
Will killing a job on the LSF queue also kill every process that job has created?
Whats the best (safest?) way to generate two independent processes from a Perl script, and to wait until one of them exits before continuing?
How can I write a watchdog that can distinguish between a processes having crashed, and a process that is suspended by the LSF admin?