2

I need to download thousands of files with curl. I know how to parallelize with xargs -Pn (or gnu parallel) but I've just discovered curl itself can parallelize downloads with the argument -Z|--parallel introduced in curl-7.66 (see curl-goez-parallel) which might be cleaner or easier to share. I need to use -o|--output option and --create-dirs. URLs need to be percent-encoded, the URL path becoming the folder path which also need to be escaped as path can contain single quotes, spaces, and usual suspects (hence -O option is not safe and -OJ option doesn't help). If I understand well, curl command should be build like so:

curl -Z -o path/to/file1 http://site/path/to/file1 -o path/to/file2 http://site/path/to/file2 [-o path/to/file3 http://site/path/to/file3, etc.]

This works indeed, but what's the best way to deal with thousands URLS. Can a config file used with -K config be useful? what if the -o path/to/file_x http://site/path/to/file_x is the output of another program? I haven't found any way to record commands in a file, one command per line, say.

halfer
  • 19,824
  • 17
  • 99
  • 186
jgran
  • 1,111
  • 2
  • 11
  • 21

1 Answers1

6

I ended up asking on curl user mailing list.

The solution that worked for me was building a config (plain text) file built like so programmatically:

config.txt:

url=http://site/path/to/file1
output="./path/to/file1"
url=http://site/path/to/file2
output="./path/to/file2"
...

and use this curl command:

$ curl --parallel --parallel-immediate --parallel-max 60 --config config.txt

The command is a bit reacher than this and requires a recent curl version (7.81+)

$ curl --parallel --parallel-immediate --parallel-max 60 --config config.txt \
  --fail-with-body --retry 5 --create-dirs \
  --write-out "code %{response_code} url %{url} type %{content_type}\n"

For options like -K, --config see https://curl.se/docs/manpage.html

HTH

jgran
  • 1,111
  • 2
  • 11
  • 21
  • 1
    Also refer to [run-multiple-curl-commands-in-parallel](https://stackoverflow.com/questions/46362284/run-multiple-curl-commands-in-parallel#answer-69644245) – jgran May 05 '22 at 10:02
  • for me it writes the full response body into each output file... any ideas how to make it write only response_code (using --write-out option) into each url result file? – Andrew May 30 '23 at 12:42
  • Any way to add (different) request headers to each url in config file? – Andrew Jul 20 '23 at 10:19
  • @Andrew I had not seen your comments. For your 1st comment, have you solved it with variables [https://everything.curl.dev/usingcurl/verbose/writeout](https://everything.curl.dev/usingcurl/verbose/writeout) : `curl -w "Type: %{content_type}\nCode: %{response_code} http://example.com`. Available variables are listed here [text-available-write-out-variables](https://everything.curl.dev/usingcurl/verbose/writeout#available-write-out-variables) – jgran Jul 21 '23 at 09:56
  • @Andrew. comment Any way to add (different) request headers to each url in config file ? I asked on curl-users mailing list ([response](https://curl.se/mail/archive-2023-07/0020.html)). So use next between urls ex `header = "Accept: to the first URL"\n url = http://first.example\n next\n header = "Accept: to the second URL"\n url = http://second.example` – jgran Jul 24 '23 at 13:49