I have a bit of a problem with designing a multiprocessed bash script that goes trough websites, follows found links and does some processing on every new page (it actually gathers email addresses but that's an unimportant detail to the problem).
The script is supposed to work like this:
- Downloads a page
- Parses out all links and adds them to queue
- Does some unimportant processing
- Pops an URL from queue and starts from
That itself would be quite simple to program, the problem arrises from two restrictions and a feature the script needs to have.
- The script must not process one URL twice
- The script must be able to process n (supplied as argument) pages at once
- The script must be POSIX complient (with the exception of curl) -> so no fancy locks
Now, I've managed to come up with an implementation that uses two files for queues one of which stores all URLs that had already been processed, and other one URLs that had been found but not yet processed.
The main proces simply spawns a bunch of child processes that all share the queue files and (in a loop until URLs-to-be-processed-queue is empty) pops top a URL from URLs-to-be-processed-queue
, process the page, try to add every newly found link to URLs-already-processed-queue
and if it succeeds (the URL is not already there) add it to URLs-to-be-processed-queue
as well.
The issue lies in the fact that you can't (AFAIK) make the queue-file operations atomic and therefore locking is necessary. And locking in POSIX complient way is ... terror... slow terror.
The way I do it is following:
#Pops first element from a file ($1) and prints it to stdout; if file emepty print out empty return 1
fuPop(){
if [ -s "$1" ]; then
sed -nr '1p' "$1"
sed -ir '1d' "$1"
return 0
else
return 1
fi
}
#Appends line ($1) to a file ($2) and return 0 if it's not in it yet; if it, is just return 1
fuAppend(){
if grep -Fxq "$1" < "$2"; then
return 1
else
echo "$1" >> "$2"
return 0
fi
}
#There're multiple processes running this function.
prcsPages(){
while [ -s "$todoLinks" ]; do
luAckLock "$linksLock"
linkToProcess="$(fuPop "$todoLinks")"
luUnlock "$linksLock"
prcsPage "$linkToProcess"
...
done
...
}
#The prcsPage downloads it, does some magic and than calls prcsNewLinks and prcsNewEmails that both get list of new emails / new urls in $1
#$doneEmails, ..., contain file path, $mailLock, ..., contain dir path
prcsNewEmails(){
luAckLock "$mailsLock"
for newEmail in $1; do
if fuAppend "$newEmail" "$doneEmails"; then
echo "$newEmail"
fi
done
luUnlock "$mailsLock"
}
prcsNewLinks(){
luAckLock "$linksLock"
for newLink in $1; do
if fuAppend "$newLink" "$doneLinks"; then
fuAppend "$newLink" "$todoLinks"
fi
done
luUnlock "$linksLock"
}
The problem is that my implementation is slow (like really slow), almost so slow that it doesn't make sense to use more than 2 10 (decreasing lock waiting help a great deal) child processes. You can actually disable the locks (just comment out the luAckLock and luUnlock bits) and it works quite ok (and much faster) but there're race conditions every once in a while with the seds -i
and it just doesn't feel right.
The worst(as I see it) is locking in prcsNewLinks
as it takes quite a lot time (most of time-run basically) and practically prevents other processes from starting to process a new page (as it requires poping new URL from (currently locked) $todoLinks
queue).
Now my question is, how to do it better, faster, and nicer?
The whole script is here (it contains some signal magic, a lot of debug outputs, and not that good code generally).
BTW: Yes, you're right, doing this in bash - and what's more in POSIX compliant way - is insane! But it's university assignment so I kinda have to do it
//Though I feel it's not really expected of me to resolve these problems (as the race conditions arise more frequently only when having 25+ threads which is probably not something a sane person would test).
Notes to the code:
- Yes, the wait should have (and already has) a random time. The code shared here was just a proof of concept written during real analysis lecture.
- Yes, the number of debug ouptuts and their formatting is terrible and there should be standalone logging function. That's, however, not point of my problem.