0

I have a script that has to kick off 2 independent processes, and wait until one of them finishes before continuing.

Up to now, I've run it by creating one process with an if fork pid == 0, exec, else wait. The other one is created using system and the command line.

Now I'm preparing to roll this script out to run 400 iterations of such work-pair processes on Platform Load Sharing Facility (LSF), however I'm concerned with stability. I know that the processes can crash. In such a case, I need a method to know when a process has crashed, and kill its pair process and the main script.

Originally I had written a watchdog with a 3 minute watch period, if 3 minutes of inactivity pass, it kills the processes. However this caught a lot of false positives, because when the LSF suspends one of the two processes, the watchdog saw them as inactive.

In LSF, when I issue the jobs, I have the option to kill them. However, when I kill a job, what exactly do I kill? Will the kill take down the two processes the Perl script has created? or leave them running as zombies?

To reiterate,

  • Will killing a job on the LSF queue also kill every process that job has created?

  • Whats the best (safest?) way to generate two independent processes from a Perl script, and to wait until one of them exits before continuing?

  • How can I write a watchdog that can distinguish between a processes having crashed, and a process that is suspended by the LSF admin?

Borodin
  • 126,100
  • 9
  • 70
  • 144
John Nikolaou
  • 161
  • 1
  • 10
  • How does LSF suspend/resume? I presume it's via `SIGSTOP` and `SIGCONT`? – Sobrique May 05 '15 at 12:38
  • "The default action is to send the following signals to the job: SIGTSTP for parallel or interactive jobs. SIGTSTP is caught by the master process and passed to all the slave processes running on other hosts. SIGSTOP for sequential jobs. SIGSTOP cannot be caught by user programs. The SIGSTOP signal can be configured with the LSB_SIGSTOP parameter in lsf.conf." From "Job Controls" for LSF. I wonder if the meaning of "slave processes" in this context refers to processes generated by the original job – John Nikolaou May 05 '15 at 12:55

1 Answers1

2

The monitor is the one that should be creating the child processes. (It can also launch the "main script" too.) wait will tell you when they crash.

my %children;

my $pid1 = fork();
if (!defined($pid1)) { ... }
if ($pid1) { ... }
++$children{$pid1};

my $pid2 = fork();
if (!defined($pid2)) { ... }
if ($pid2) { ... }
++$children{$pid2};

while (keys(%children)) {
   my $pid = wait();
   next if !$children{$pid};  # !!!

   delete($children{$pid});

   if ($? & 0x7F) { ... }   # Killed from signal
   if ($? >> 8) { ... }     # Returned an error
}
ikegami
  • 367,544
  • 15
  • 269
  • 518
  • So your idea is to restructure the scripts? so that you have one script which calls the main script, and also creates 2 child processes, which it uses wait on, to catch the return value and see if its an error or killed by a signal? – John Nikolaou May 05 '15 at 14:39
  • yup. The main script can inherit pipes to the children. – ikegami May 05 '15 at 14:57
  • Also, you might not be aware of `setsid(2)` (available as `POSIX::setsid()`). Creates a process group to which you can send a signal. – ikegami May 05 '15 at 14:59
  • Thanks a lot, this looks like the best way to move forward with this! – John Nikolaou May 05 '15 at 15:21