5

wget is used to mirror sites, but i wanted to know how does the utility download all the URLs for the domain.

wget -r www.xyz.com

How does wget download all the URLs of the domain xyz? Does it visit the index page and parse it and extract links like a crawler ?

Tushar Poddar
  • 169
  • 1
  • 8

2 Answers2

4

Short answer: usually, yes, Wget will crawl all URLs, with some exceptions:

  • URLs blocked by robots.txt
  • The website contains URLS deeper than default crawl depth
  • Using an older version of Wget which may not retrieve all files in certain CSS situations

As for the starting point, Wget is simply starting from whatever URL you gave it, in this case www.xyz.com. Since most web server software will return the index page when no page is specified, then Wget receives the index page to start with.

Details

man for GNU Wget 1.17.1:

Wget can follow links in HTML, XHTML, and CSS pages ... This is sometimes referred to as "recursive downloading."

But adds:

While doing that, Wget respects the Robot Exclusion Standard (/robots.txt).

So if /robots.txt specifies not to index /some/secret/page.htm of course this would be excluded by default, same with any other crawler that respects robots.txt.

Also, there exists a default depth limit:

-r

--recursive

Turn on recursive retrieving. The default maximum depth is 5.

So if for some reason there happen to be links deeper than 5, to meet your original wish to capture all URLs you might want to use the -l option, such as -l 6 to go six deep:

-l depth

--level=depth

Specify recursion maximum depth level depth.

Also, note that earlier versions of Wget had trouble with assets found in CSS that were in turn linked by @import url, as reported in: wget downloads CSS @import, but ignores files referenced within them. But they didn't say what version they used, and I didn't test the latest version yet. My workaround at the time was to manually figure out what assets were missing and write a separate Wget command specifically for those missing assets.

Community
  • 1
  • 1
clarity123
  • 1,956
  • 10
  • 16
0

Yes. I found out that what wget does is - it parses the given URL and then recursively downloads through all the embedded links.

Tushar Poddar
  • 169
  • 1
  • 8