1

Complete newbie, I know you probably can't use the variables like that in there but I have 20 minutes to deliver this so HELP

read -r -p "Month?: " month
read -r -p "Year?: " year

URL= "https://gz.blockchair.com/ethereum/blocks/"

wget -w 2 --limit-rate=20k "${URL}blockchair_ethereum_blocks_$year$month*.tsv.gz"
exit
Cyrus
  • 84,225
  • 14
  • 89
  • 153
nyetrying
  • 21
  • 2

3 Answers3

2

There are two issues with your code.

First, you should remove the whitespace that follows the equal symbol when you declare your URL variable. So the line becomes

URL="https://gz.blockchair.com/ethereum/blocks/"

Then, you are building your URL using a wildcard, which is not allowed in this case. So you cannot do something like month*.tsv.gz as you are doing right know. If you need to perform requests to several URLs, you need to run wget for each one of them.

canta2899
  • 394
  • 1
  • 7
  • Thank you for the very timely answer, I have fixed the space issue, and attempted to add a third variable for the *day*, which worked and it downloads the correct file. My question is: would it be possible to have multiple wget requests (30 for the whole month) whilst using something akin to a double empty space (standin wildcard for the XX day format)? – nyetrying Jun 24 '22 at 09:54
  • If you are using GNU date you can take a look at [this](https://unix.stackexchange.com/questions/445355/i-want-while-loop-for-date-2018-03-28-to-2018-04-02-in-unix) thread in which a user asks how to get all the dates in between a certain interval (which, in you case, can be the one that starts with the first day of the month and ends with the last one). Then, for each date obtained, you can compose your URL and run the wget command. Hope that helps. – canta2899 Jun 24 '22 at 10:02
1

It's possible to do what you're trying to do with wget, however, this particular site's robots.txt has a rule to disallow crawling of all files (https://gz.blockchair.com/robots.txt):

User-agent: *
Disallow: /

That means the site's admins don't want you to do this. wget respects robots.txt by default, but it's possible to turn that off with -e robots=off.

For this reason, I won't post a specific, copy/pasteable solution.

Here is a generic example for selecting (and downloading) files using a glob pattern, from a typical html index page:

url=https://www.example.com/path/to/index

wget \
--wait 2 --random-wait --limit=20k \
--recursive --no-parent --level 1 \
--no-directories \
-A "file[0-9][0-9]" \
"$url"
  • This would download all files named file, with a two digit suffix (file52 etc), that are linked on the page at $url, and whose parent path is also $url (--no-parent).

  • This is a recursive download, recursing one level of links (--level 1). wget allows us to use patterns to accept or reject filenames when recursing (-A and -R for globs, also --accept-regex, --reject-regex).

  • Certain sites block may block the wget user agent string, it can be spoofed with --user-agent.

  • Note that certain sites may ban your IP (and/or add it to a blacklist) for scraping, especially doing it repeatedly, or not respecting robots.txt.

dan
  • 4,846
  • 6
  • 15
0

In case of downloading blocks for every day in a month, you may just change in original script the * symbol to a argument, let's say day and previously assign a variable days to a list of days.

Then iterate like for day in days… and do your wget stuff.