How to implement multithreaded access to file-based queue in bash script

Question

I have a bit of a problem with designing a multiprocessed bash script that goes trough websites, follows found links and does some processing on every new page (it actually gathers email addresses but that's an unimportant detail to the problem).

The script is supposed to work like this:

Downloads a page
Parses out all links and adds them to queue
Does some unimportant processing
Pops an URL from queue and starts from

That itself would be quite simple to program, the problem arrises from two restrictions and a feature the script needs to have.

The script must not process one URL twice
The script must be able to process n (supplied as argument) pages at once
The script must be POSIX complient (with the exception of curl) -> so no fancy locks

Now, I've managed to come up with an implementation that uses two files for queues one of which stores all URLs that had already been processed, and other one URLs that had been found but not yet processed.

The main proces simply spawns a bunch of child processes that all share the queue files and (in a loop until URLs-to-be-processed-queue is empty) pops top a URL from URLs-to-be-processed-queue, process the page, try to add every newly found link to URLs-already-processed-queue and if it succeeds (the URL is not already there) add it to URLs-to-be-processed-queue as well.

The issue lies in the fact that you can't (AFAIK) make the queue-file operations atomic and therefore locking is necessary. And locking in POSIX complient way is ... terror... slow terror.

The way I do it is following:

#Pops first element from a file ($1) and prints it to stdout; if file emepty print out empty return 1
fuPop(){
if [ -s "$1" ]; then
        sed -nr '1p' "$1"
        sed -ir '1d' "$1"
        return 0
else
        return 1
fi
}

#Appends line ($1) to a file ($2) and return 0 if it's not in it yet; if it, is just return 1
fuAppend(){
if grep -Fxq "$1" < "$2"; then
        return 1
else
        echo "$1" >> "$2"
        return 0
fi
}

#There're multiple processes running this function.
prcsPages(){
while [ -s "$todoLinks" ]; do
        luAckLock "$linksLock"
        linkToProcess="$(fuPop "$todoLinks")"
        luUnlock "$linksLock"

        prcsPage "$linkToProcess"
        ...
done
...
}


#The prcsPage downloads it, does some magic and than calls prcsNewLinks and prcsNewEmails that both get list of new emails / new urls in $1
#$doneEmails, ..., contain file path, $mailLock, ..., contain dir path

prcsNewEmails(){
luAckLock "$mailsLock"
for newEmail in $1; do
        if fuAppend "$newEmail" "$doneEmails"; then
                echo "$newEmail"
        fi
done
luUnlock "$mailsLock"
}

prcsNewLinks(){
luAckLock "$linksLock"
for newLink in $1; do
        if fuAppend "$newLink" "$doneLinks"; then
                fuAppend "$newLink" "$todoLinks"
        fi
done
luUnlock "$linksLock"
}

The problem is that my implementation is slow (like really slow), almost so slow that it doesn't make sense to use more than 2 10 (decreasing lock waiting help a great deal) child processes. You can actually disable the locks (just comment out the luAckLock and luUnlock bits) and it works quite ok (and much faster) but there're race conditions every once in a while with the seds -i and it just doesn't feel right.

The worst(as I see it) is locking in prcsNewLinks as it takes quite a lot time (most of time-run basically) and practically prevents other processes from starting to process a new page (as it requires poping new URL from (currently locked) $todoLinks queue).

Now my question is, how to do it better, faster, and nicer?

The whole script is here (it contains some signal magic, a lot of debug outputs, and not that good code generally).

BTW: Yes, you're right, doing this in bash - and what's more in POSIX compliant way - is insane! But it's university assignment so I kinda have to do it

//Though I feel it's not really expected of me to resolve these problems (as the race conditions arise more frequently only when having 25+ threads which is probably not something a sane person would test).

Notes to the code:

Yes, the wait should have (and already has) a random time. The code shared here was just a proof of concept written during real analysis lecture.
Yes, the number of debug ouptuts and their formatting is terrible and there should be standalone logging function. That's, however, not point of my problem.

Your `for var in $1` is a pretty nasty idiom. It breaks when `"$1"` contains spaces, since you don't quote it. And there's no point to using a loop at all. Just do `var=$1`. (word splitting and glob expansion don't happen in assignments, so you don't need quotes. But they don't hurt, so it's not bad practice to ALWAYS quote everywhere, even in assignments and inside `[[ $var == foo ]]` where they aren't needed.) — Peter Cordes, Apr 17 '15 at 22:26
Yep, I know. I usually quotes everything, must've forgotten that one bit :). And thanks for the tip with not needing loop, it's elegant :). — Petrroll, Apr 17 '15 at 22:37
Normally you'd use a loop over "$@", BTW. Forgot to mention that. Also for really short functions, I sometimes don't bother assigning the positional parameters to `local meaningful_name=$1`. In shell functions, you should use `local` variables unless you WANT to interact with the caller's variables of the same name. — Peter Cordes, Apr 18 '15 at 00:33

Peter Cordes · Accepted Answer · 2015-04-18T12:46:31.090

First of all, do you need to implement your own HTML/HTTP spidering? Why not let wget or curl recurse through a site for you?

You could abuse the filesystem as a database, and have your queues be directories of one-line files. (or empty files where the filename is the data). That would give you producer-consumer locking, where producers touch a file, and consumers move it from the incoming to the processing/done directory.

The beauty of this is that multiple threads touching the same file Just Works. The desired result is the url appearing once in the "incoming" list, and that's what you get when multiple threads create files with the same name. Since you want de-duplication, you don't need locking when writing to the incoming list.

1) Downloads a page

2) Parses out all links and adds them to queue

For each link found,

grep -ql "$url" already_checked || : > "incoming/$url"

or

[[ -e "done/$url" ]] || : > "incoming/$url"

3) Does some unimportant processing

4) Pops an URL from queue and starts from 1)

# mostly untested, you might have to debug / tweak this
local inc=( incoming/* )
# edit: this can make threads exit sooner than desired.
# See the comments for some ideas on how to make threads wait for new work
while [[ $inc != "incoming/*" ]]; do
    # $inc is shorthand for "${inc[0]}", the first array entry
    mv "$inc" "done/" || { rm -f "$inc"; continue; } # another thread got that link, or that url already exists in done/
    url=${inc#incoming/}
    download "$url";
    for newurl in $(link_scan "$url"); do
        [[ -e "done/$newurl" ]] || : > "incoming/$newurl"
    done
    process "$url"
    inc=( incoming/* )
done

edit: encoding URLs into strings that don't contain / is left as an exercise for the reader. Although probably urlencoding / to %2F would work well enough.

I was thinking of moving URLs to a "processing" list per thread, but actually if you don't need to be able to resume from interruption, your "done" list can instead be a "queued & done" list. I don't think it's actually ever useful to mv "$url" "threadqueue.$$/" or something.

The "done/" directory will get pretty big, and start to slow down with maybe 10k files, depending on what filesystem you use. It's probably more efficient to maintain the "done" list as a file of one url per line, or a database if there's a database CLI interface that's fast for single commands.

Maintaining the done list as a file isn't bad, because you never need to remove entries. You can probably get away without locking it, even for multiple processes appending it. (I'm not sure what happens if thread B writes data at EOF between thread A opening the file and thread A doing a write. Thread A's file position might be the old EOF, in which case it would overwrite thread B's entry, or worse, overwrite only part of it. If you do need locking, maybe flock(1) would be useful. It gets a lock, then executes the commands you pass as args.)

If broken files from lack of write locking doesn't happen, then you might not need write locking. The occasional duplicate entry in the "done" list will be a tiny slowdown compared to having to lock for every check/append.

If you need strictly-correct avoidance of downloading the same URL multiple times, you need readers to wait for the writer to finish. If you can just sort -u the list of emails at the end, it's not a disaster for a reader to read an old copy while the list is being appended. Then writers only need to lock each other out, and readers can just read the file. If they hit EOF before a writer manages to write a new entry, then so be it.

I'm not sure whether it matters if a a thread adds entries to the "done" list before or after they remove them from the incoming list, as long as they add them to "done" before processing. I was thinking that one way or the other might make races more likely to cause duplicate done entries, and less likely to make duplicate downloads / processing, but I'm not sure.

I'm curious to hear how well this works out in practice, since I didn't really test any of it. — Peter Cordes, Apr 18 '15 at 00:33
Suppose that the `incoming` directory is empty and the other _n-1_ processes are just performing the `download "$url"` line. Our process performs `inc=( incoming/* )`, which expands to `incoming/*` and exits the loop, although the processing may be far from done. I suggest making the outer loop `while true` and turning `inc=( incoming/* )` into `while [ ( incoming/* ) = 'incoming/*' ]; do sleep 1; done`. — Witiko, Apr 18 '15 at 12:33
Good catch. I remember thinking, hmm, "what if the readers empty the queue?", and concluding it wasn't likely. But actually, if you start with just the base URL in your `incoming`, then one thread will grab it and the others will all exit. Derp. I thought about just leaving the loop condition as `while true` (which I had originally), and leave exitting as an exercise for the reader. — Peter Cordes, Apr 18 '15 at 12:35
If it's desirable for the script to detect the end of the processing, the sleepers could create a `sleepers/$PID` file, when then are waiting for the `incoming` directory to fill up. The parent would then `kill $(jobs -p)` the children, when `ls sleepers | wc -l` becomes `n`. — Witiko, Apr 18 '15 at 12:36
hmm, interesting idea. That looks like a good way to actually detect that all threads are done. I was thinking a control process could touch a file that makes the threads exit if it exists and they run out of work, but your idea is better. (But I'd do the check with pure bash: `sleepers=( sleepers/* )`, and check `"${#sleepers[@]}"`.) — Peter Cordes, Apr 18 '15 at 12:38
The `[ ( incoming/* ) = 'incoming/*' ]` I mentioned is also pseudocode. One way to test for directory emptiness would be `[ -z "$(ls -A incoming)" ]`. A way to do this via pure bash would be `(inc=( incoming/* ); [ ${inc[0]} = 'incoming/*' ])`, but I'm reluctant to recommend this, since this breaks, when a file `incoming/*` actually exists (not that it could happen in this case). I'm curious, if there is a robust bash-only way to perform this test. — Witiko, Apr 18 '15 at 13:41
AFAIK, only with `shopt -s nullglob`, to avoid the file-with-glob-name ambiguity you point out. Or `[[ -e 'incoming/*' ]]` to catch that case specifically, if the expansion matches the pattern might do the trick. — Peter Cordes, Apr 18 '15 at 13:58
That's actually fairly concise: `(inc=("$DIR"/*); [[ ${inc[0]} = "$DIR"/\* && ! -e ${inc[0]} ]])`. — Witiko, Apr 18 '15 at 14:15
`$inc` is a valid shorthand for `${inc[0]}`, BTW. The bash man page guarantees that no subscript is the same as `0`. I would still write `[0]` if I'm also using other array elements. — Peter Cordes, Apr 18 '15 at 17:55

How to implement multithreaded access to file-based queue in bash script

1 Answers1