1

I have written a small bash script for crawling an XML sitemap of URLs. It retrieves 5 URLs in parallel using xargs.

Now I want an E-Mail to be sent when all URLs have been crawled, so it has to wait until all sub-processes of xargs have finished and then send the mail.

I have tried with a pipe after the xargs:

#!/bin/bash

wget --quiet --no-cache -O- http://some.url/test.xml | egrep -o "http://some.url[^<]+" | xargs -P 5 -r -n 1 wget --spider | mail...

and with wait

#!/bin/bash

wget --quiet --no-cache -O- http://some.url/test.xml | egrep -o "http://some.url[^<]+" | xargs -P 5 -r -n 1 wget --spider

wait

mail ... 

Which both doesn't work, the email is sent immediately after the script is executed. How can I achieve this? Unfortunately I don't have the parallel program on my server (managed hosting).

Alex
  • 322
  • 1
  • 4
  • 12
  • See [this question](http://stackoverflow.com/questions/356100) in Stack Overflow. You'll collect PID(s) of your curl(s) with `$!` and then wait them (each of them as if some of them have already finished, waits will return immediately). –  Dec 06 '16 at 15:08
  • How would I do that? Since xargs starts the child-processes, I don't have the possibility to get the PIDs? And xargs starts new processes automatically as soon as one finishes. – Alex Dec 06 '16 at 15:59
  • Good point. I'd avoid the xargs and instead do some loops. That got too long to add to a comment; hence, an answer below. –  Dec 07 '16 at 16:15
  • I found out that my second version with the `wait` is actually working now, no idea why it wasn't working at the time I tested it for the first time. – Alex Dec 12 '16 at 11:29

1 Answers1

1

Instead of using xargs, spawn each wget individually on the background and collect the PIDs of background processes in a list. Additionally, ensure that the output of background processes gets written to a file.

Once all background processes have been spawned, go through all PIDs on the list and wait each one -- the ones already exited will not block at wait. Having now, hopefully, waited all background processes successfully, all there is left to do is to concatenate outputs from each background process to a single file and mail that to wherever output is needed.

Something along the lines of (echos are, of course, redundant and for demonstration purposes only):

#!/bin/bash

mail=$(tempfile)
pids=()
outputs=()

trap "rm -f ${outputs[@]}" EXIT
trap "rm -f $mail" EXIT

for url in $(wget --quiet --no-cache -O- http://some.url/test.xml |\
             egrep -o "http://some.url[^<]+") ; do
  output=$(tempfile)
  wget --spider > $output 2>&1 &
  pids+=($!)
  outputs+=($output)
  echo "Spawned wget and got PID ${pids[-1]}."
done

for pid in ${pids[@]} ; do
  echo "Waiting for PID $pid."
  wait $pid
done

# Concatenate outputs from individual processes into a single file.
for output in ${outputs[@]} ; do cat $output >> $mail ; done

# Mail that file.
< $mail mail -s "All outputs" some.user@some.domain

# end of file.
  • Hi Sami, thanks for your example. While that would work possibly, spawning all wget processes simultaneously is not an option here, because I have about 10000 URLs in that XML sitemap. So that would mean 10000 simultaneous wget procesesses, which is likely to be too much for the server. – Alex Dec 08 '16 at 15:44
  • Obviously you could fiddle with the for-loop a bit to make it spawn a dozen, then wait those etc. I'm sure you know how Server Fault is not a free programming service, so you're kind of expected to improve on the examples given in answers. –  Dec 08 '16 at 15:58
  • Yes, I know that, thanks. I found out that my second version with the `wait` is actually working now, no idea why it wasn't working at the time I tested it for the first time. – Alex Dec 12 '16 at 11:30