Running a loop bash curl script with GNU Parallel

Question

Just recently started programming in bash and came across GNU Parallel, which is exactly, what I need for my project. Have a basic loop script, which is meant to loop through the list of ip's and ping each, one time. The list with the ip's is constantly updated with the new ones, driven by the other script.

For multithreading, I would like to use the GNU Parallel.

My idea was to run 10 Parallel instances, each will capture one ip from the list, insert it into the curl command and removes it from the list, so the other instances of won't pick it up.

#! /bin/bash
while true; do

while read -r ip; do
curl $ip >> result.txt
sed -i '1,1 d' iplist
done < ipslist
done

I'm not sure, what's the right way to run the bash script, in this case, every solution I could find, doesn't work properly and things get totally messy. I have a feeling, this all can be done with a single line, but, for my own reasons, I'd prefer to run it as bash script. Would be grateful for any help!

Are you using this to check which hosts are up and serving? Maybe it's easier to use `nmap` for that, instead of writing your own. I'm not certain `nmap` can actually perform HTTP requests, but just checking that 80/TCP is open might be sufficient. — Thomas, Mar 19 '18 at 10:11
Thanks for the suggestion, I'm aware of the supremacy of nmap and the other tools, for many tasks, but, for this project, I need to use curl — Nikolas K, Mar 19 '18 at 11:51

score 2 · Answer 1 · answered Mar 21 '18 at 08:06

Thomas' solution looks like the correct for this particular situation. If, however, you need to do more than simply curl then I will recommend making a function:

#! /bin/bash

doit() {
  ip="$1"
  curl "$ip"
  echo do other stuff here
}
export -f doit

while true; do
  parallel -j10 doit < ipslist >> result.txt
done

If you want to ipslist to be a queue so you can later add stuff to the queue and you only want it curled once:

tail -n+0 -f ipslist | parallel doit >> result.txt

Now you can later simply add stuff to ipslist and GNU Parallel will curl that, too.

(There is a a small issue when using GNU parallel as queue system/batch manager: You have to submit JobSlot number of jobs before they will start, and after that you can submit one at a time, and job will start immediately if free slots are available. Output from the running or completed jobs are held back and will only be printed when JobSlots more jobs has been started (unless you use --ungroup or --line-buffer, in which case the output from the jobs are printed immediately). E.g. if you have 10 jobslots then the output from the first completed job will only be printed when job 11 has started, and the output of second completed job will only be printed when job 12 has started.)

score 1 · Answer 2 · answered Mar 19 '18 at 10:17

1

This works for me:

#!/bin/bash

while true; do
  parallel -j10 curl '{}' < ipslist >> result.txt
done

If that's not what you intended, please update your question to clarify.

answered Mar 19 '18 at 10:17

Thomas

174,939
50
355
478

While it looks elegant and simple, I'd rather not use the parallel command from within the script, but run the script with parallel, from the terminal. As the script will be growing later, it would be better to keep it as pure bash. Also, I don't see how your solution avoids using the same line over and over again, but I might be wrong.I needs something like : parallel -n0 script.sh ::: {1..10} solution, but, for some reason it just doesn't work for me. – Nikolas K Mar 19 '18 at 11:54
Please edit your question to clarify what you need exactly, and why that solution doesn't work. – Thomas Mar 19 '18 at 14:08
@NikolasK The reason why it does not work, it most likely that you both read and edit the same file in parallel from multiple processes. It is pretty hard to get right without having a race condition. Also @Thomas' solution is likely faster, as he does not edit `ipslist`. – Ole Tange Mar 21 '18 at 07:51
@NikolasK Are you aware that you can embed GNU Parallel in a bash script using `--embed`? Available from version 20180222. – Ole Tange Mar 21 '18 at 07:52

score 0 · Answer 3 · answered Mar 19 '18 at 16:20

I would just use xargs. Not many people seem to know this, but there's much more to it than the standard usage to just squeeze every line of the input on a single line. That is, this:

echo -e "A\nB\nC\nD\nE" | xargs do_something

would essentially mean the same as this:

do_something A B C D E

However you can specify, how many lines are processed in one chunk, using the -L option:

echo -e "A\nB\nC\nD\nE" | xargs -L2 do_something

would translate to:

do_something A B
do_something C D

Additionally, you can also specify, how many of these chunks run in parallel, with the -P option. So to process the lines one-by-one, with a parallelism of, say 3, you would say:

echo -e "A\nB\nC\nD\nE" | xargs -L1 -P3 do_something

Et voilà, you have proper parallel execution, with basic unix tools.

The only catch is, that you have to make sure you'll separate the outputs. I am not sure, whether this has been thought of before, but a solution for the curl case is something like this:

cat url_list.txt | xargs -L1 -P10 curl -o paralell_#0.html

Where #0 will be replaced by cURL with the URL being fetched. See the manuals for further details:

score -1 · Answer 4 · answered Mar 19 '18 at 16:03

-1

You can do this and it will work :

#! /bin/bash
while true; do

   while read -r ip; do
      curl $ip >> result.txt &
      sed -i '1,1 d' iplist
   done < ipslist
wait
done

answered Mar 19 '18 at 16:03

Matias Barrios

4,674
3
22
49

If I test this on iplist with 1000000 IPs then my machine stops responding. – Ole Tange Mar 21 '18 at 07:47

Running a loop bash curl script with GNU Parallel

4 Answers4

Linked