2

everyone!

I'm trying to make parallel curl requests from array of urls to speed up bash script. After research I found out, that there are several approaches: GNU parallel, x-args, built-in curl options (--parallel) starting from 7.68.0, using ampersand. Best option would be not to use GNU parallel, since it demands GNU installation and I'm not allowed to do it.

Here is my initial script:

#!/bin/bash

external_links=(https://www.designernews.co/error https://www.awwwards.com/error1/ https://dribbble.com/error1 https://www.designernews.co https://www.awwwards.com https://dribbble.com)
invalid_links=()

for external_link in "${external_links[@]}"
  do
      curlResult=$(curl -sSfL --max-time 60 --connect-timeout 30 --retry 3 -4 "$external_link" 2>&1 > /dev/null) && status=$? || status=$?
      if [ $status -ne 0 ] ; then
          if [[ $curlResult =~ (error: )([0-9]{3}) ]]; then
              error_code=${BASH_REMATCH[0]}
              invalid_links+=("${error_code} ${external_link}")
              echo "${external_link}"
          fi
      fi
      i=$((i+1))
  done

echo "Found ${#invalid_links[@]} invalid links: "
printf '%s\n' "${invalid_links[@]}"

I tried to change the curl options, added x-args, ampersand, but failed to succeed. All examples I found were mostly using GNU or reading data from the file, none of them worked with variable containg array of URLs (Running programs in parallel using xargs, cURL with variables and multiprocessing in shell). Could you, please, help me with this issue?

bullet03
  • 73
  • 5

2 Answers2

1

The main problem is that you can't modify your array directly from a sub-process. A possible work-around is to use a FIFO file for transmitting the results of the sub-processes to the main program.
remark: As long as the messages are shorter than getconf SSIZE_MAX bytes, the writes to the FIFO file are guaranteed to be atomic.

#!/bin/bash

tempdir=$(mktemp -d) &&
mkfifo "$tempdir/fifo" &&
exec 3<> "$tempdir/fifo" || exit 1

external_links=(
    https://www.designernews.co/error
    https://www.awwwards.com/error1/
    https://dribbble.com/error1
    https://www.designernews.co
    https://www.awwwards.com
    https://dribbble.com
)

for url in "${external_links[@]}"
do
    {
        curl -4sLI -o /dev/null -w '%{http_code}' --max-time 60 --connect-timeout 30 --retry 2 "$url"
        echo "/$url"
    } >&3 &
done

invalid_links=()

for (( i = ${#external_links[@]}; i > 0; i-- ))
do
    IFS='/' read -u 3 -r http_code url
    (( 200 <= http_code && http_code <= 299 )) || invalid_links+=( "$http_code $url" )
done

echo "Found ${#invalid_links[@]} invalid links:"
(( ${#invalid_links[@]} > 0 )) && printf '%s\n' "${invalid_links[@]}"

remarks:

  • Here I consider a link valid when curl's output is in the 200-299 range.
  • When the web server doesn't exists or reply, curl's output is 000

UPDATE

Limiting the number of curl parallel requests to 10.

It's possible to add the logic for that with the shell but if you have BSD or GNU xargs then you can replace the whole for url in "${external_links[@]}"; do ...; done loop with:

printf '%s\0' "${external_links[@]}" |

xargs -0 -P 10 -n 1 sh -c '
    curl -4sLI \
         -o /dev/null \
         -w "%{http_code}" \
         --max-time 60 \
         --connect-timeout 30 \
         --retry 2 "$0"
    printf /%s\\n "$0"
' 1>&3

Furthermore, if your curl is at least 7.75.0 then you should able to replace the sh -c '...' with a single curl command (untested):

printf '%s\0' "${external_links[@]}" |

xargs -0 -P 10 -n 1 \
    curl -4sLI \
         -o /dev/null \
         -w "%{http_code}/%{url}\n" \
         --max-time 60 \
         --connect-timeout 30 \
         --retry 2 \
1>&3
Fravadona
  • 13,917
  • 1
  • 23
  • 35
  • I know that it wasn't in initial question, but is there a way of optimization in this code to make no more than 10 parallel curl requests? I found that x-args, curl or curl --parallel have such an option. – bullet03 Aug 15 '22 at 13:09
  • @bullet03 I added a BSD/GNU `xargs` solution – Fravadona Aug 15 '22 at 14:01
1

Appreciate you can't use GNU parallel option at present, but just for reference, a strategy using files to store the http response data. Consider this solution.

declare -a external_links=(
    https://www.designernews.co/error 
    https://www.awwwards.com/error1/ 
    https://dribbble.com/error1 
    https://www.designernews.co 
    https://www.awwwards.com 
    https://dribbble.com
)

# change to any writeable directory
cd

declare -a invalid_links=()

# clear the file in which http responses will be stored
echo "" > response_data.txt

# output each completing parallel http response data to a single file
curl -sSfL ${external_links[@]} \
--parallel  \
--max-time 60 \
--connect-timeout 30 \
--retry 3  \
--write-out "\nEND OF RESPONSE FOR URL^%{url_effective}^%{remote_ip}^%{http_code}\n" >response_data.txt
2>errorfile.txt

# clear a file to store http responses temporarily
echo " " > outfile.txt

cat response_data.txt | \
while read line
do
    if [[ ! "$line" =~ 'END OF RESPONSE' ]]
    then
        echo "$line" >> outfile.txt
    else
        filename=${line#'END OF RESPONSE FOR URL^'*'^'}
        filename=${filename//^/_}
        effective_url=${line#'END OF RESPONSE FOR URL^'}
        effective_url=${effective_url%%'^'*}
        response_code=${line##*'^'}
        echo "effective_url: $effective_url"
        echo "response_code: $response_code" && echo
        if [ $response_code -ne 200 ]
        then
            invalid_links+=( "${response_code} ${effective_url}" )
            echo "Found ${#invalid_links[@]} invalid links: " >result_file.txt
            printf '%s\n' "${invalid_links[@]}" >> result_file.txt
        fi
        cp outfile.txt "$filename".txt
        echo " " > outfile.txt
    fi
done

cat result_file.txt

Output:

curl: (22) The requested URL returned error: 404 Not Found
curl: (22) The requested URL returned error: 404 
curl: (22) The requested URL returned error: 404 
effective_url: https://www.designernews.co/error
response_code: 404

effective_url: https://www.awwwards.com/error1/
response_code: 404

effective_url: https://dribbble.com/
response_code: 200

effective_url: https://dribbble.com/error1
response_code: 404

effective_url: https://www.awwwards.com/
response_code: 200

effective_url: https://www.designernews.co/
response_code: 200

Found 3 invalid links: 
404 https://www.designernews.co/error
404 https://www.awwwards.com/error1/
404 https://dribbble.com/error1

Perhaps some adjustment is still needed, but all required information is available.

adebayo10k
  • 81
  • 5