Parallel processing (multithreading) bash script

Question

I have have this bash script that I made which makes an api request for a very big list of user accounts (almost 10,000)

#!/bin/bash

#Variables
Names_list="/home/debian/names.list"
auth_key="***"

#For loop
Users=$(cat $Names_list)
for n in ${Users[@]}
do
        curl --silent --request GET \
        --url https://api.example.com/$n \
        --header 'authorization: Bearer '$auth_key'' \
        --data '{}' >> /home/debian/results.list
done

echo "Done."

My pain is that with the current way of working my bearer token expires before the calls can complete. It only has a 30 minute lifetime and it the calls start returning an unauthorized error at around seven to eight thousand.

I understand that I can just split up the big list file with something like "split" and then set the script to background the task with &, but I cannot wrap my head around that part.

Since the API I am using is private and has no rate limiting, I was thinking of bursting the ~10,000 calls in batches of 1 or 2 thousand.

Like this:

#!/bin/bash

cat_split(){
   cat $file;
}

Split_results="/home/debian/split.d/"

for file in ${Split_results[@]}
do
        cat_split &
done

Yes, that does work as a poc, but I don't know what the best way of going around this is now. Should I place in my api call in another function or have one function that does the cat and then the api call? What would you consider a proper way of going around this?

Thanks for any advice in advance.

Why not just use **GNU Parallel**? Type `[gnu-parallel]` in the Search box above. — Mark Setchell, Jan 15 '22 at 16:52
@MarkSetchell Yes, I did see that when I was searching over stack, but I ultimately decided that I wouldn't use it since I don't really want to install anything. My user account has no sudo rights anyway. — ewofjo02jf0, Jan 15 '22 at 17:12
You don't need sudo rights and it's just a Perl script such as you might write yourself. I don't have any axe to grind but you may like to read this. http://oletange.blogspot.com/2013/04/why-not-install-gnu-parallel.html — Mark Setchell, Jan 15 '22 at 17:14
@MarkSetchell, Since I lack a lot of knowledge on the topic, I mainly want to break away form mangling inefficient scripts together and actually start learning from people who have a good sense of logic whit bash scripting. — ewofjo02jf0, Jan 15 '22 at 17:15
Well, IMHO, if you are trying to do lots of things, especially high-latency requests like `curl`, you're better off paying all those latencies in parallel rather than sequentially, one after another, but I know **GNU Parallel** is not for everyone. — Mark Setchell, Jan 15 '22 at 17:19
Split your file $Names_list in two parts and use a for loop for every part parallel. — Cyrus, Jan 15 '22 at 17:23
@Cyrus So I can consider that the general "best course of action" would be to dish out everything in bulk? — ewofjo02jf0, Jan 15 '22 at 17:29
@MarkSetchell I do see myself finding a use case for it in the future, but I feel like using it in this case would be taking a side-approach instead of trying to make due with what I have built in in every system. — ewofjo02jf0, Jan 15 '22 at 17:31
Or, run your `for` loop as you have it but put an ampersand (`&`) after each `curl` to parallelise it and do a `wait` after every 32 requests so you never spawn more than 32 parallel requests. — Mark Setchell, Jan 15 '22 at 17:48
What `curl --version` do you use? curl has `--parallel` option since version 7.66.0. The list of urls can be given to curl using `-K` (though not sure if 10,000 would work). — , Jan 15 '22 at 17:49
I have curl version 7.74.0 on my computer, and thanks for that tip. I had no clue that curl had such an option. @rowboat — ewofjo02jf0, Jan 15 '22 at 17:55
Just to add, maybe putting all the urls into a temp file (`mktemp`) and just `cat --parallel` the file to the curl command instead of setting up a config file for `-K` might be easier. Also I have used `xargs` before along with curl for parallel downloads, but if u want to see the progress output I would be less than ideal. — Devansh Sharma, Jan 15 '22 at 18:21
run a google search on `bash wait parallel`; you'll find a list of stackoverflow and stackexchange posts that will give you a few ideas; general idea is to place N number of jobs in the background (`&`; aka parallel) and then `wait`, with variations looking at running a batch of jobs in parallel and then .... a) `wait`ing for all to comlpete before doing the next batch ... b) `wait -n` for one to complete and then submit a new one ... c) peridiocally running `jobs` to get a count of the number of 'still running' jobs — markp-fuso, Jan 15 '22 at 20:15
perhaps let [curl read list of URLs from a file](https://stackoverflow.com/a/66998627) ? — markp-fuso, Jan 15 '22 at 20:19

score 1 · Answer 1 · answered Jan 16 '22 at 11:14

I understand that I can just split up the big list file with something like "split" and then set the script to background the task with &, but I cannot wrap my head around that part.

Put the stuff to execute in a function. Then use GNU parallel or xargs.

doit() {
   # command in an array for comments
   cmd=(
        curl --silent --request GET
        --url "https://api.example.com/$1"
        # check your scripts with shellcheck
        --header "authorization: Bearer $auth_key"
        --data '{}'
   )
   # execute it and store in variable
   tmp=$("{$cmd[@]}")
   # output in a single call, so that hopefully buffering does not bite us that much
   # if it does, use a separate file for each call
   # or use GNU parallel
   printf "%s\n" "$tmp"
}
export auth_key   # export needed variables
export -f doit    # export needed functions

# I replaced /home/debian by ~
# run xargs that runs bash that runs the command with passed argument
# see man xargs , man bash
xargs -d '\n' -P 1000 -n 1 bash -c 'doit "$@"' _ < ~/names.list > ~/results.list

of bursting the ~10,000 calls in batches of 1 or 2 thousand.

You would have to write a manual loop for that:

trap 'kill $(jobs -p)' EXIT  # kill all jobs on ctrl+c
n=0
max=1000
# see https://mywiki.wooledge.org/BashFAQ/001
while IFS= read -r line; do
   if ((++n > max)); then
       wait  # wait for all the currently running processes
       n=0
   fi
   doit "$line" &
done < ~/names.list > ~/results.list
wait

Problems with your script:

Use shellcheck to check your scripts
do not use for i in $(cat..) also do not use it in the form of tmp=$(cat ..); for i in $tmp. See https://mywiki.wooledge.org/BashFAQ/001
Users is not an array
$n and $auth_key are not quoted. In particular, auth_key="***" is replaced by all your files in current directory.
your second script does not wait for them.

Upvote for "output in a single call, so that hopefully buffering does not bite us that much". — Ole Tange, Jan 16 '22 at 12:09

Parallel processing (multithreading) bash script

1 Answers1