1

I am working on a Java program that launches a child process, receives data through stdout and performs some calculation and this process repeats. I run this program on a supercomputer that uses a Torque-related PBS with some special scheduling feature that suspends jobs periodically in such as way as to maximise system utilisation.

One problem I had during execution was an instance where my child process mysteriously hung (cause currently unknown), causing Java to wait for a response that was never going to arrive. What I would like to do is monitor this process and enforce an execution time cutoff, ie., if the process runs for an unusual amount of time, die and and throw some kind of error letting me know that this happened.

Normally, I would use an Apache commons exec watchdog to do this. But I am worried that any time this job spends suspended will contribute to this cutoff (assuming it uses the difference between start and finish System.currentTimeMillis()). Would an Apache commons exec watchdog suffer from this? Is there any way to exclude any suspend time in the elapsed time calculation?

jason_r
  • 33
  • 6
  • If Java is spawning a child process, then it would seem that Torque would suspend only the Java process and not the child process. This would mean that the child process runs continuously and you could use normal unix tools to query the child process's CPU time vs. Wall Clock time, no? A random thought, could the Java process be missing the signal of the child's exit because it is suspended by Torque at that time? – Sam Jun 08 '12 at 03:35
  • Sam, thanks for your reply! I suspect the job suspension in this case is a little more sophisticated, although I'm not sure on this. I think the PBS is not a stock standard implementation... In any case, whether or not I am right (and I am very naive when it comes to the ins and outs of torque, _et al._), perhaps I should investigate standard unix methods for CPU vs wall time instead. Thanks for the advice! – jason_r Jun 08 '12 at 04:30
  • Have a look at [this JavaWorld article](http://www.javaworld.com/jw-12-2000/jw-1229-traps.html) to make sure you're allowing for all of the Process execution gotchas that can cause child processes to hang. –  Jun 08 '12 at 04:54
  • Glad to help. I apologize, I neglected to mention that the "ps" command has a user_time time and system_time that can be printed, given a process ID. Might be helpful. Good luck! – Sam Jun 08 '12 at 04:55

0 Answers0