Parallel download using Curl command line utility

Question

I want to download some pages from a website and I did it successfully using curl but I was wondering if somehow curl downloads multiple pages at a time just like most of the download managers do, it will speed up things a little bit. Is it possible to do it in curl command line utility?

The current command I am using is

curl 'http://www...../?page=[1-10]' 2>&1 > 1.html

Here I am downloading pages from 1 to 10 and storing them in a file named 1.html.

Also, is it possible for curl to write output of each URL to separate file say URL.html, where URL is the actual URL of the page under process.

pre-request to find out the content-length, use `--range` to splice the single to multiple downloads, run multi-process curl, maintain order of chunks and join them as soon you've got am orderly sequence, it is what most developers are doing (for example: [htcat project](https://github.com/eladkarako/htcat)) — , Dec 02 '15 at 02:20
How do you know how many pages to download? Are you just arbitrarily selecting 1 to 10? — ghoti, Nov 20 '22 at 17:33
Related question: https://stackoverflow.com/q/9865866/1072112 ... though it geared towards file downloads, the explanation of curl usage in the selected answer may be useful. — ghoti, Nov 20 '22 at 17:33

ndronen · Answer 1 · 2014-06-18T02:38:11.233

My answer is a bit late, but I believe all of the existing answers fall just a little short. The way I do things like this is with xargs, which is capable of running a specified number of commands in subprocesses.

The one-liner I would use is, simply:

$ seq 1 10 | xargs -n1 -P2 bash -c 'i=$0; url="http://example.com/?page${i}.html"; curl -O -s $url'

This warrants some explanation. The use of -n 1 instructs xargs to process a single input argument at a time. In this example, the numbers 1 ... 10 are each processed separately. And -P 2 tells xargs to keep 2 subprocesses running all the time, each one handling a single argument, until all of the input arguments have been processed.

You can think of this as MapReduce in the shell. Or perhaps just the Map phase. Regardless, it's an effective way to get a lot of work done while ensuring that you don't fork bomb your machine. It's possible to do something similar in a for loop in a shell, but end up doing process management, which starts to seem pretty pointless once you realize how insanely great this use of xargs is.

Update: I suspect that my example with xargs could be improved (at least on Mac OS X and BSD with the -J flag). With GNU Parallel, the command is a bit less unwieldy as well:

parallel --jobs 2 curl -O -s http://example.com/?page{}.html ::: {1..10}

Also note that if you have a full-featured version of xargs, you can simply do the following: `seq 1 10 | xargs -I{} -P2 -- curl -O -s 'http://example.com/?page{}.html'` — Six, Jul 27 '15 at 06:48

score 31 · Accepted Answer · edited Sep 03 '16 at 14:55

31

Well, curl is just a simple UNIX process. You can have as many of these curl processes running in parallel and sending their outputs to different files.

curl can use the filename part of the URL to generate the local file. Just use the -O option (man curl for details).

You could use something like the following

urls="http://example.com/?page1.html http://example.com?page2.html" # add more URLs here

for url in $urls; do
   # run the curl job in the background so we can start another job
   # and disable the progress bar (-s)
   echo "fetching $url"
   curl $url -O -s &
done
wait #wait for all background jobs to terminate

edited Sep 03 '16 at 14:55

Kenster

23,465
21
80
106

answered Dec 26 '11 at 17:57

nimrodm

23,081
7
58
59

6

let say I have to download 100 pages... your script will start 100 curl instances simultaneously(might choke the network)... can we do something like at any given point of time, only X instances of `curl` are running and as soon as one of them finishes its job, the script launch another instance... some sort of `Job Scheduling`?? – Ravi Gupta Dec 27 '11 at 10:14
Ravi.. this gets more difficult. You need a job queue served by multiple processes. One simple solution would be to send all jobs to the UNIX `batch` command (try `man batch`). It executes jobs when the system load is below a certain threshold. So most jobs would be queued and only a few will be running at a time. – nimrodm Dec 27 '11 at 19:02
@EladKarako: note the `&` at the end of the `curl` command. This runs the `curl` job in the background without blocking the main process. So yes, this is definitely parallel (although I still prefer the `make` trick or even parallel `xargs`) – nimrodm Dec 02 '15 at 03:51
2

GNU [`parallel`](https://manpages.debian.org/jessie/moreutils/parallel.1.en.html) can limit the number of "jobslots". ([examples](https://gist.github.com/CMCDragonkai/5914e02df62137e47f32)) – joeytwiddle May 21 '17 at 06:43

score 24 · Answer 3 · answered Dec 16 '19 at 14:30

24

As of 7.66.0, the curl utility finally has built-in support for parallel downloads of multiple URLs within a single non-blocking process, which should be much faster and more resource-efficient compared to xargs and background spawning, in most cases:

curl -Z 'http://httpbin.org/anything/[1-9].{txt,html}' -o '#1.#2'

This will download 18 links in parallel and write them out to 18 different files, also in parallel. The official announcement of this feature from Daniel Stenberg is here: https://daniel.haxx.se/blog/2019/07/22/curl-goez-parallel/

answered Dec 16 '19 at 14:30

Andrew Pantyukhin

349
2
4

To limit the number of concurrent download, one can use the `--parallel-max [num]` flag. – toaruScar Jul 31 '20 at 03:30
How to provide a list of urls from the file and write http status code (only) to the list of files (or one file)? – Andrew May 30 '23 at 14:56

score 8 · Answer 4 · answered Dec 21 '20 at 12:08

curl and wget cannot download a single file in parallel chunks, but there are alternatives:

aria2 (written in C++, available in Deb and Cygwin repo's)
```
aria2c -x 5 <url>
```
axel (written in C, available in Deb repo)
```
axel -n 5 <url>
```
wget2 (written in C, available in Deb repo)
```
wget2 --max-threads=5 <url>
```
lftp (written in C++, available in Deb repo)
```
lftp -n 5 <url>
```
hget (written in Go)
```
hget -n 5 <url>
```
pget (written in Go)
```
pget -p 5 <url>
```

aria2 is also available as homebrew install. – james-see Jul 22 '22 at 12:23 — james-see, Jul 22 '22 at 12:23

score 7 · Answer 5 · answered Apr 08 '21 at 06:47

Starting from 7.68.0 curl can fetch several urls in parallel. This example will fetch urls from urls.txt file with 3 parallel connections:

curl --parallel --parallel-immediate --parallel-max 3 --config urls.txt

urls.txt:

url = "example1.com"
output = "example1.html"
url = "example2.com"
output = "example2.html"
url = "example3.com"
output = "example3.html"
url = "example4.com"
output = "example4.html"
url = "example5.com"
output = "example5.html"

score 6 · Answer 6 · edited Mar 05 '17 at 21:20

Curl can also accelerate a download of a file by splitting it into parts:

$ man curl |grep -A2 '\--range'
       -r/--range <range>
              (HTTP/FTP/SFTP/FILE)  Retrieve a byte range (i.e a partial docu-
              ment) from a HTTP/1.1, FTP or  SFTP  server  or  a  local  FILE.

Here is a script that will automatically launch curl with the desired number of concurrent processes: https://github.com/axelabs/splitcurl

score 6 · Answer 7 · answered Jul 07 '13 at 19:38

For launching of parallel commands, why not use the venerable make command line utility.. It supports parallell execution and dependency tracking and whatnot.

How? In the directory where you are downloading the files, create a new file called Makefile with the following contents:

# which page numbers to fetch
numbers := $(shell seq 1 10)

# default target which depends on files 1.html .. 10.html
# (patsubst replaces % with %.html for each number)
all: $(patsubst %,%.html,$(numbers))

# the rule which tells how to generate a %.html dependency
# $@ is the target filename e.g. 1.html
%.html:
        curl -C - 'http://www...../?page='$(patsubst %.html,%,$@) -o $@.tmp
        mv $@.tmp $@

NOTE The last two lines should start with a TAB character (instead of 8 spaces) or make will not accept the file.

Now you just run:

make -k -j 5

The curl command I used will store the output in 1.html.tmp and only if the curl command succeeds then it will be renamed to 1.html (by the mv command on the next line). Thus if some download should fail, you can just re-run the same make command and it will resume/retry downloading the files that failed to download during the first time. Once all files have been successfully downloaded, make will report that there is nothing more to be done, so there is no harm in running it one extra time to be "safe".

(The -k switch tells make to keep downloading the rest of the files even if one single download should fail.)

"-j 5" tells make to run at most 5 curl commands in parallel. — Jonas Berlin, Oct 06 '13 at 20:40
Really the best solution since it allows resuming failed downloads and uses 'make' which is both robust and available on any unix system. — nimrodm, May 06 '15 at 18:20
This is a great answer. Thoroughly explained and shows some nice features of make in the process — Matt Greer, Feb 16 '19 at 02:58
The only issue to use this method is I cannot really remember the `$(patsubst %,%.html,$(numbers))` part. This is way more hard than tar. — Mayli, Dec 11 '20 at 21:22

score 2 · Answer 8 · answered Sep 17 '14 at 10:23

Run a limited number of process is easy if your system have commands like pidof or pgrep which, given a process name, return the pids (the count of the pids tell how many are running).

Something like this:

#!/bin/sh
max=4
running_curl() {
    set -- $(pidof curl)
    echo $#
}
while [ $# -gt 0 ]; do
    while [ $(running_curl) -ge $max ] ; do
        sleep 1
    done
    curl "$1" --create-dirs -o "${1##*://}" &
    shift
done

to call like this:

script.sh $(for i in `seq 1 10`; do printf "http://example/%s.html " "$i"; done)

The curl line of the script is untested.

Slava Ignatyev · Answer 9 · 2018-08-01T10:04:18.780

I came up with a solution based on fmt and xargs. The idea is to specify multiple URLs inside braces http://example.com/page{1,2,3}.html and run them in parallel with xargs. Following would start downloading in 3 process:

seq 1 50 | fmt -w40 | tr ' ' ',' \
| awk -v url="http://example.com/" '{print url "page{" $1 "}.html"}' \
| xargs -P3 -n1 curl -o

so 4 downloadable lines of URLs are generated and sent to xargs

curl -o http://example.com/page{1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16}.html
curl -o http://example.com/page{17,18,19,20,21,22,23,24,25,26,27,28,29}.html
curl -o http://example.com/page{30,31,32,33,34,35,36,37,38,39,40,41,42}.html
curl -o http://example.com/page{43,44,45,46,47,48,49,50}.html

score 0 · Answer 10 · answered Nov 20 '22 at 18:34

Bash 3 or above lets you populate an array with multiple values as it expands sequence expressions:

$ urls=( "" http://example.com?page={1..4} )
$ unset urls[0]

Note the [0] value, which was provided as shorthand to make the indices line up with page numbers, since bash arrays autonumber starting at zero. This strategy obviously might not always work. Anyway, you can unset it in this example.

Now you have a an array, and you can verify the contents with declare -p:

$ declare -p urls
declare -a urls=([1]="http://example.com?Page=1" [2]="http://example.com?Page=2" [3]="http://example.com?Page=3" [4]="http://example.com?Page=4")

Now that you have a list of URLs in an array, expand the array into a curl command line:

$ curl $(for i in ${!urls[@]}; do echo "-o $i.html ${urls[$i]}"; done)

The curl command can take multiple URLs and fetch all of them, recycling the existing connection (HTTP/1.1) to a common server, but it needs the -o option before each one in order to download and save each target. Note that characters within some URLs may need to be escaped to avoid interacting with your shell.

score -5 · Answer 11 · answered Dec 26 '11 at 08:41

-5

I am not sure about curl, but you can do that using wget.

wget \
     --recursive \
     --no-clobber \
     --page-requisites \
     --html-extension \
     --convert-links \
     --restrict-file-names=windows \
     --domains website.org \
     --no-parent \
         www.website.org/tutorials/html/

answered Dec 26 '11 at 08:41

zengr

38,346
37
130
192

1

how to set the number of parallel downloads ? – user1767316 Nov 23 '19 at 14:29

Parallel download using Curl command line utility

11 Answers11

Linked