1

Set Up

I currently have the below script working to download files with curl, using a ref file with multiple variables. When I created the script it suited my needs however as the ref file has gotten larger and the data I am requesting via curl is takes longer to generate, my script is now taking too much time to complete.

Objective

I want to be able to update this script so I have curl request and download multiple files as they are ready - as opposed to waiting for each file to be requested and downloaded sequentially.

I've had a look around and seen that I could use either xargs or parallel to achieve this however based on the past questions I've seen, youtube videos and other forum posts, I have haven't been able to find an example that explains if this is possible using more than one variable.

Can someone confirm if this is possible and which tool is better suited to achieve this? Is my current script in the right configuration or do I need to amend a lot of it to shoe horn these commands in?

I suspect this may be a questions that's been asked previously and I may have just not found the right one.

account-list.tsv

client1 account1    123 platform1   50
client2 account1    234 platform1   66
client3 account1    344 platform1   78
client3 account2    321 platform1   209
client3 account2    321 platform2   342
client4 account1    505 platform1   69

download.sh

#!/bin/bash
set -eu

user="user"
pwd="pwd"
D1=$(date "+%Y-%m-%d" -d "1 days ago")
D2=$(date "+%Y-%m-%d" -d "1 days ago")
curr=$D2
cheese=$(pwd)

curl -o /dev/null -s -S -L -f -c cookiejar 'https://url/auth/' -d name=$user -d passwd=$pwd

while true; do

        while IFS=$'    ' read -r client account accountid platform platformid
        do
                curl -o /dev/null -s -S -f -b cookiejar -c cookiejar 'https://url/auth/' -d account=$accountid
                curl -sSfL -o "$client€$account@$platform£$curr.xlsx" -J -b cookiejar -c cookiejar "https://url/platform=$platformid&date=$curr"
        done < account-list.tsv

        [ "$curr" \< "$D1" ] || break
        curr=$( date +%Y-%m-%d --date "$curr +1 day" ) ## used in instances where I need to grade data for past date ranges.

done

exit
El_Birdo
  • 315
  • 4
  • 19

2 Answers2

1

One (relatively) easy way to run several processes in parallel is to wrap the guts of the call in a function and then call the function inside the while loop, making sure to put the function call in the background, eg:

# function definition

docurl () {
    curl -o /dev/null -s -S -f -b cookiejar -c cookiejar 'https://url/auth/' -d account=$accountid
    curl -sSfL -o "$client€$account@$platform£$curr.xlsx" -J -b cookiejar -c cookiejar "https://url/platform=$platformid&date=$curr"
}

# call the function within OP's inner while loop

while true; do

    while IFS=$'    ' read -r client account accountid platform platformid
    do
        docurl &            # put the function call in the background so we can continue loop processing while the function call is running

    done < account-list.tsv

    wait                    # wait for all background calls to complete 

    [ "$curr" \< "$D1" ] || break

    curr=$( date +%Y-%m-%d --date "$curr +1 day" ) ## used in instances where I need to grade data for past date ranges.
done

One issue with this approach is that for a large volume of curl calls it may be possible to bog down the underlying system and/or cause the remote system to reject 'too many' concurrent calls. In this case it'll be necessary to limit the number of concurrent curl calls.

One idea would be to keep a counter of the number of currently running (backgrounded) curl calls and when we hit a limit we wait for a background process to complete before spawning a new one, eg:

max=5                       # limit of 5 concurrent/backgrounded calls
ctr=0

while true; do

    while IFS=$'    ' read -r client account accountid platform platformid
    do
        docurl &

        ctr=$((ctr+1))

        if [[ "${ctr}" -ge "${max}" ]]
        then
            wait -n         # wait for a background process to complete
            ctr=$((ctr-1))
        fi

    done < account-list.tsv

    wait                    # wait for last ${ctr} background calls to complete

    [ "$curr" \< "$D1" ] || break

    curr=$( date +%Y-%m-%d --date "$curr +1 day" ) ## used in instances where I need to grade data for past date ranges.
done
markp-fuso
  • 28,790
  • 4
  • 16
  • 36
  • Sorry for the slow reply. Didn't see the notification that there was a response. I tested out your approach and was able to download a couple of files before I got `curl: (22) The requested URL returned error: 500 Internal Server Error`. I tested it with `max=` being set as low as 1 but it still spat out the error. I'm not too familiar with curl errors but I'm guessing this relates to your point about remote systems rejecting my requests? So perhaps where I am requesting files from will limit me to how many requests I can send. – El_Birdo Nov 16 '20 at 14:19
  • would need more info on the actual error message to understand what's going on – markp-fuso Nov 16 '20 at 14:20
  • sorry, I'm not sure what else I could supply to work this out. I tried running the verbose mode of curl to see if there was additional input I could offer but there isn't anything past the error I showed. – El_Birdo Nov 16 '20 at 20:47
1

Using GNU Parallel it looks something like this to fetch 100 entries in parallel:

#!/bin/bash
set -eu

user="user"
pwd="pwd"
D1=$(date "+%Y-%m-%d" -d "1 days ago")
D2=$(date "+%Y-%m-%d" -d "1 days ago")
curr=$D2
cheese=$(pwd)

curl -o /dev/null -s -S -L -f -c cookiejar 'https://url/auth/' -d name=$user -d passwd=$pwd

fetch_one() {
    client="$1"
    account="$2"
    accountid="$3"
    platform="$4"
    platformid="$5"

    curl -o /dev/null -s -S -f -b cookiejar -c cookiejar 'https://url/auth/' -d account=$accountid
    curl -sSfL -o "$client€$account@$platform£$curr.xlsx" -J -b cookiejar -c cookiejar "https://url/platform=$platformid&date=$curr"
}
export -f fetch_one

while true; do
    cat account-list.tsv | parallel -j100 --colsep '\t' fetch_one
    [ "$curr" \< "$D1" ] || break
    curr=$( date +%Y-%m-%d --date "$curr +1 day" ) ## used in instances where I need to grade data for past date ranges.
done

exit
Ole Tange
  • 31,768
  • 5
  • 86
  • 104
  • Sorry for the slow reply. Didn't see the notification that there was a response. I tested your suggestion but got `parrallel-test.sh: 35: export: Illegal option -f`. I copied and pasted as well as typed the command but not sure why this error comes through given the fact the function is correct before it. – El_Birdo Nov 16 '20 at 14:21
  • @El_Birdo are you sure you are including #!/bin/bash in the program? – Ole Tange Nov 17 '20 at 06:59
  • Ah, didn't realise this was a bash specific thing and was using `sh -x` when I ran the script. Can confirm there wasn't the error when run as bash. Like the other answer provided I get a `curl: (22) The requested URL returned error: 500 Internal Server Error` for every file request, but can't really provide much more detail around why it's doing that. I guess this is more a user specific issue I will need to dig into but in both respects I would say your answers work otherwise. – El_Birdo Nov 17 '20 at 08:22
  • managed to find some output from curl that made some sense I believe. The issue is more with the location I am requesting as opposed to the script. Seems the location doesn't allow for parallel processes. Accepting this answer as feel more comfortable with the approach and how I can apply it with other things I've written – El_Birdo Nov 17 '20 at 08:51