0

While running cronjobs and using mk-job from check_mk to monitor its result, I've stumbled across this:

bash:

$ /usr/bin/time -f "time_exit: %x" timeout -s SIGKILL 2s sleep 10; echo "shell_exit: $?"
Command terminated by signal 9
time_exit: 0
shell_exit: 137

The exit code returned from /usr/bin/time differs from the exit code it writes to a formatted output:

time_exit != shell_exit

why?

But when using the default SIGHUP signal, exit codes match:

$ /usr/bin/time -f "time_exit: %x" timeout -s SIGHUP 2s sleep 10; echo "shell_exit: $?"
Command exited with non-zero status 124
time_exit: 124
shell_exit: 124

In the meanwhile I will use timeout -k 10s 2s ... which will first send SIGHUP and after 10s a SIGKILL, if the process was still running. In the hope that SIGHUP would properly stop it.

Background

check_mk provides mk-job to monitor job executions. mk-job uses time to record execution times AND exit code.

man time:

The time command returns when the program exits, stops, or is terminated by a signal. If the program exited normally, the return value of time is the return value of the program it executed and measured. Otherwise, the return value is 128 plus the number of the signal which caused the program to stop or terminate.

man timeout:

... It may be necessary to use the KILL (9) signal, since this signal cannot be caught, in which case the exit status is 128+9 rather than 124.

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459

1 Answers1

1

GNU time's %x only makes sense when the process exits normally, not killed by signals.

[STEP 101] $ /usr/bin/time -f "%%x = %x" bash -c 'exit 2'; echo '$? = '$?
Command exited with non-zero status 2
%x = 2
$? = 2
[STEP 102] $ /usr/bin/time -f "%%x = %x" bash -c 'exit 137'; echo '$? = '$?
Command exited with non-zero status 137
%x = 137
$? = 137
[STEP 103] $ /usr/bin/time -f "%%x = %x" bash -c 'kill -KILL $$'; echo '$? = '$?
Command terminated by signal 9
%x = 0
$? = 137
[STEP 104] $

For time timeout -s SIGKILL 2s sleep 10, timeout exits normally with 137, it's not killed by SIGKILL, just like bash -c 'exit 137' in my example.


UPDATE:

Took a look at time's source code and found out %x is blindly calling WEXITSTATUS() no matter the process exits normally or not.

655             case 'x':           /* Exit status.  */
656               fprintf (fp, "%d", WEXITSTATUS (resp->waitstatus));
657               break;

In the Git master it added new %Tx:

549             case 'T':
550               switch (*++fmt)
551                 {
...
575                 case 'x': /* exit code IF terminated normally */
576                   if (WIFEXITED (resp->waitstatus))
577                     fprintf (fp, "%d", WEXITSTATUS (resp->waitstatus));
578                   break;

And from the Git master's time --help output:

  ...

  %Tt  exit type (normal/signalled)
  %Tx  numeric exit code IF exited normally
  %Tn  numeric signal code IF signalled
  %Ts  signal name IF signalled
  %To  'ok' IF exited normally with code zero

  ...
pynexj
  • 19,215
  • 5
  • 38
  • 56
  • Because time is running in the same process as timeout and timeout kills itself? Do you know about another way to capture the exit code in case the (child-)process is killed? – Florin Hillebrand Mar 08 '21 at 13:30
  • for the "timeout exiting with 137" case, `timeout` exits with 137, `timeout` itself is not killed by SIGKILL, just like `bash -c 'exit 137'` in my example. – pynexj Mar 08 '21 at 13:51
  • So it means `time` waits for `timeout` and gets the exit code 137, subtracts 128 to get the signal 9. So why would `time` itself return 137 but print 0? I would expect `time` to print the same exit code as it uses to exit, except `time` is killed. – Florin Hillebrand Mar 08 '21 at 15:27
  • as i said in the answer, `%x` only makes sense when the command exits **normally**. `%x` is always 0 when the command is killed by signals. that's how `time` is designed/implemented. – pynexj Mar 09 '21 at 01:12
  • its like [WEXITSTATUS(status)](https://man7.org/linux/man-pages/man3/wait.3p.html) only makes sense when *WIFEXITED(status)* is TRUE. – pynexj Mar 09 '21 at 01:22
  • _%x is always 0 when the command is killed by signals_ This is how you describe it, but I cannot find any part in the manual that would support this statement. There is _Not all resources are measured by all versions of Unix, so some of the values might be reported as zero._, but we are not talking about a resource that can be measured, since it should be returned by *wait3* and could be gathered with WIFSIGNALED and WTERMSIG. – Florin Hillebrand Mar 09 '21 at 07:42
  • I think I've found the relevant statement in [wait3 manual](https://man7.org/linux/man-pages/man3/wait.3p.html): _It is unspecified whether the status value returned by calls to wait() or waitpid() for processes created by posix_spawn() or posix_spawnp() can indicate a WIFSIGNALED(stat_val) if a signal is sent to the parent's process group after posix_spawn() or posix_spawnp() is called._. – Florin Hillebrand Mar 09 '21 at 07:42
  • What made it suspicious is that on a KILL signal it would print _Command terminated by signal 9_ but not return any value for %x. – Florin Hillebrand Mar 09 '21 at 07:45
  • i think that's because `%x` is defined as "Exit status of the command" in the manual. a killed process does not have a normal "exit status". – pynexj Mar 09 '21 at 07:48
  • 1
    looked at the source code and updated the answer. – pynexj Mar 09 '21 at 08:58
  • Thanks a lot @pynexj for the deep insights, I should have checked it myself, nevertheless I enjoyed arguing with you! – Florin Hillebrand Mar 09 '21 at 09:10