I have a Bash script running a bunch of background jobs in parallel. Under certain conditions, before a background job completes, it sends a USR1 signal to the spawning Bash process (say, to inform that some process that was run as a part of the job had terminated with a nonzero exit code).
In a simplified form, the script is equivalent to the one shown below.
Here, for simplicity, each background job always sends a USR1 signal
before completion, unconditionally (via the signalparent()
function).
signalparent() { kill -USR1 $$; }
handlesignal() { echo 'USR1 signal caught' >&2; }
trap handlesignal USR1
for i in {1..10}; do
{
sleep 1
echo "job $i finished" >&2
signalparent
} &
done
wait
When I run the above script (using Bash 3.2.57 on macOS 11.1, at least), I observe some behavior that I cannot explain, which makes me think that there is something in the interplay of Bash job management and signal trapping that I overlook.
Specifically, I would like to acquire an explanation for the following behaviors.
Almost always, when I run the script, I see fewer “signal caught” lines in the output (from the
handlesignal()
function) than there are jobs started in thefor
-loop—most of the time it is one to four of those lines that are printed for ten jobs being started.Why is it that, by the time the
wait
call completes, there are still background jobs whose signalingkill
commands had not been yet executed?At the same time, every so often, in some invocations of the script, I observe the
kill
command (from thesignalparent()
function) report an error regarding the originating process running the script (i.e., the one with the$$
PID) no longer being present—see the output below.How come there are jobs whose signaling
kill
commands are still running while the parent shell process had already terminated? It was my understanding that it is impossible for the parent process to terminate before all background jobs do, due to thewait
call.job 2 finished job 3 finished job 5 finished job 4 finished job 1 finished job 6 finished USR1 signal caught USR1 signal caught job 10 finished job 7 finished job 8 finished job 9 finished bash: line 3: kill: (19207) - No such process bash: line 3: kill: (19207) - No such process bash: line 3: kill: (19207) - No such process bash: line 3: kill: (19207) - No such process
Both of these behaviors signalize to me a presence of a race condition of some kind, whose origins I do not quite understand. I would appreciate if anyone could enlighten me on those, and perhaps even suggest how the script could be changed to avoid such race conditions.