It's possible to do what you're trying to do with wget
, however, this particular site's robots.txt has a rule to disallow crawling of all files (https://gz.blockchair.com/robots.txt):
User-agent: *
Disallow: /
That means the site's admins don't want you to do this. wget
respects robots.txt
by default, but it's possible to turn that off with -e robots=off
.
For this reason, I won't post a specific, copy/pasteable solution.
Here is a generic example for selecting (and downloading) files using a glob pattern, from a typical html index page:
url=https://www.example.com/path/to/index
wget \
--wait 2 --random-wait --limit=20k \
--recursive --no-parent --level 1 \
--no-directories \
-A "file[0-9][0-9]" \
"$url"
This would download all files named file
, with a two digit suffix (file52
etc), that are linked on the page at $url
, and whose parent path is also $url
(--no-parent
).
This is a recursive download, recursing one level of links (--level 1
). wget
allows us to use patterns to accept or reject filenames when recursing (-A
and -R
for globs, also --accept-regex
, --reject-regex
).
Certain sites block may block the wget
user agent string, it can be spoofed with --user-agent
.
Note that certain sites may ban your IP (and/or add it to a blacklist) for scraping, especially doing it repeatedly, or not respecting robots.txt
.