0

I'm facing a strange race condition in my bash program. I tried duplicating it via a simple enough demo program but, obviously, as true for all/most timing-related race demonstration attempts, I couldn't.

Here's an abstracted version of the program that DOES NOT duplicate the issue, but let me still explain:

# Abstracted version of the original program
# that is NOT able to demo the race.
#
function foo() {
    local instance=$1

    # [A lot of logic here -
    #  all foreground commands, nothing in the background.]

    echo "$instance: test" > /tmp/foo.$instance.log        
    echo "Instance $instance ended"
}

# Launch the process in background...
#
echo "Launching instance 1"
foo 1 &

# ... and wait for it to complete.
#
echo "Waiting..."
wait
echo "Waiting... done.  (wait exited with: $?)"

# This ls command ALWAYS fails in the real
# program in the 1st while-iteration, complaining about 
# missing files, but works in the 2nd iteration!
#
# It always works in the very 1st while-iteration of the
# abstracted version.
#
while ! ls -l /tmp/foo.*; do
    :
done

In my original program (and NOT in the above abstracted version), I do see Waiting... done. (wait exited with: 0) on stdout, just as I see in the above version. Yet, the ls -l always fails in the original, but always works in the above abstracted version in the very first while loop iteration.

Also, the ls command fails despite seeing the Instance 1 ended message on stdout. The output is:

$ ./myProgram
Launching instance 1
Waiting...
Waiting... done. (wait exited with: 0)
Instance 1 ended
ls: cannot access '/tmp/foo.*': No such file or directory
/tmp/foo.1
$

I noticed that the while loop can be safely done away with if I put a sleep 1 right before ls in my original program, like so:

# This too works in the original program:
sleep 1
ls -l /tmp/foo.*

Question: Why isn't wait working as expected in my original program? Any suggestions to at least help troubleshoot the problem?

I'm using bash 4.4.19 on Ubuntu 18.04.

EDIT: I just also verified that the call to wait in the original, failing program is exiting with a status code of 0.

EDIT 2: Shouldn't the Instance 1 ended message appear BEFORE Waiting... done. (wait exited with: 0)? Could this be a 'flushing problem' with OS' disk-buffer/cache when dealing with background processes in bash?

EDIT 3: If instead of the while loop or sleep 1 hacks, I issue a sync command, then, voila, it works! But why should I have to do a sync in one program but the other?

Harry
  • 3,684
  • 6
  • 39
  • 48
  • 1
    Can you place an `echo foo $instance ended` at the end of the functions, to see when they actually end? – AlexP Jul 12 '18 at 01:41
  • Please [adapt the example code until you can reproduce it](https://stackoverflow.com/help/mcve). We have no way of determining which part of the difference (if any) between the two is responsible for the issue. – l0b0 Jul 12 '18 at 01:50
  • @AlexP When I do that, I get one error from the failed `ls` command in my `while` loop version (`ls: cannot access '/tmp/foo.*': No such file or directory`), followed by `foo 1 ended` message, followed by the `ls` command successfully completing. I'm now launching only 1 `foo` process instead of two. – Harry Jul 12 '18 at 02:01
  • @l0b0 If I could adapt and duplicate it, I probably would have no need to come here and ask. I'm only seeking possible suggestions/ideas on what to try next to troubleshoot, and NOT an actual solution. – Harry Jul 12 '18 at 02:07
  • Is `/tmp` a regular file system? Can you reproduce the error if you use something else (like your home directory) instead of `/tmp`? – chepner Jul 12 '18 at 03:27
  • In my original program, the parent directory of the log file is somewhere under `/home/me/...` only. It is local to the system, and is not a network-share. The filesystem-type is the default, Ubuntu 18.04 `ext4fs`. – Harry Jul 12 '18 at 03:37
  • @AlexP Please see my answer below covering the 3 hacks that work. I'm not 100% sure why they work. – Harry Jul 12 '18 at 04:26
  • @chepner Please see my answer below covering the 3 hacks that work. I'm not 100% sure why they work. – Harry Jul 12 '18 at 04:26
  • 1
    The only thing I can suggest is to start with your original code, and remove bits until you can't reproduce the error. There's probably something in your code that you've left out of your abstract example because you believe it can't possibly be relevant, except that it *is* (or you could reproduce the problem). – chepner Jul 12 '18 at 12:08
  • 1
    Is there any chance the wait is not running in the original parent process? Could it be running in another `&` spawned sub-shell? – Jon Jul 12 '18 at 13:04
  • @chepner A million thanks for encouraging me to do the tedious but the obvious. Namely, removing bits from the original, failing program till it got working again, AND adding bits to my abstracted, working program to get it to fail. In the interest of honesty, let me go ahead and document my stupidity. – Harry Jul 12 '18 at 14:32
  • @Jon That was a very close guess, and got me thinking in the right direction, along with the age-old, bit-wise tweaking advice from @chepner. **The Real Problem:** I was starting `foo`, not directly/plainly as shown in my inaccurate abstracted version, but via another `threadStarter` function that, after doing some bookkeeping, would also say `foo 1 &` in its body. And the call to `threadStarter` was itself suffixed with an `&`! So, my `wait` was really waiting on `threadStarter` and not on `foo`. The `sleep`, `sync`, and `while` just bought more time for `foo` to complete, I think. – Harry Jul 12 '18 at 14:45
  • Let me add another **EDIT** section to my original question that duplicates the problem, at least on my system. This demo may still be sensitive to timing, scheduling btw, so I can get even the buggy version to work by inserting extra function calls here and there. – Harry Jul 12 '18 at 14:48

1 Answers1

0

I noticed that each the following three hacks work, but not quite sure why:

Hack 1

while ! ls -l /tmp/foo.*; do
    :
done

Hack 2

sleep 1
ls -l /tmp/foo.*

Hack 3

sync
ls -l /tmp/foo.*

Could this be a 'flushing problem' with OS' disk-buffer/cache, especially when dealing with background processes, especially in bash? In other words, the call to wait seems to returning BEFORE it flushes the diskcache (or, BEFORE the OS, on its own realizes and, is done flushing the diskcache).

EDIT Thanks to @Jon, his was a very close guess and got me thinking in the right direction, along with the age-old, bit-wise tweaking advice from @chepner.

The Real Problem: I was starting foo, not directly/plainly as shown in my inaccurate abstracted version in my original question, but via another launchThread function that, after doing some bookkeeping, would also say foo 1 & in its body. And the call to launchThread was itself suffixed with an &! So, my wait was really waiting on launchThread and not on foo! The sleep, sync, and while just were helping buy more time for foo to complete, which is why introducing them worked. The following is a more accurate demonstration of the problem, even though you may or may not be able to duplicate it on your own system (due to scheduling/timing variance across systems):

#!/bin/bash -u

function now() {
    date +'%Y-%m-%d %H:%M:%S'
}

function log() {
    echo "$(now) - $@" >> $logDir/log # Line 1
}

function foo() {
    local msg=$1
    log "$msg"
    echo "  foo ended"
}

function launchThread() {
    local f=$1
    shift
    "$f" "$@" &  # Line 2
}

logDir=/tmp/log

/bin/rm -rf "$logDir"
mkdir -p "$logDir"

echo "Launching foo..."
launchThread foo 'message abc' &  # Line 3

echo "Waiting for foo to finish..."
wait
echo "Waiting for foo to finish... done. (wait exited with: $?)"

ls "$logDir"/log*

Output of the above buggy program:

Launching foo...
Waiting for foo to finish...
Waiting for foo to finish... done. (wait exited with: 0)
  foo ended
ls: cannot access '/tmp/log/log*': No such file or directory

If I remove the & from EITHER Line 2 OR from Line 3, the program works correctly, with the following as output:

Launching foo...
Waiting for foo to finish...
  foo ended
Waiting for foo to finish... done. (wait exited with: 0)
/tmp/log/log

The program also works correctly if I remove the $(now) part from Line 1.

Harry
  • 3,684
  • 6
  • 39
  • 48
  • I'm marking my answer 'tentatively as final', unless someone has an accurate (or, even better) explanation of what's going on. For all we know, this could well be a bug in `bash`. Thanks to @alexp and @chepner for their time. – Harry Jul 12 '18 at 06:05