Short answer: usually, yes, Wget will crawl all URLs, with some exceptions:
- URLs blocked by
robots.txt
- The website contains URLS deeper than default crawl depth
- Using an older version of Wget which may not retrieve all files in certain CSS situations
As for the starting point, Wget is simply starting from whatever URL you gave it, in this case www.xyz.com
. Since most web server software will return the index page when no page is specified, then Wget receives the index page to start with.
Details
man for GNU Wget 1.17.1
:
Wget can follow links in HTML, XHTML, and CSS pages ... This is sometimes referred to as "recursive downloading."
But adds:
While doing that, Wget respects the Robot Exclusion Standard (/robots.txt).
So if /robots.txt
specifies not to index /some/secret/page.htm
of course this would be excluded by default, same with any other crawler that respects robots.txt
.
Also, there exists a default depth limit:
-r
--recursive
Turn on recursive retrieving. The default maximum depth is 5.
So if for some reason there happen to be links deeper than 5, to meet your original wish to capture all URLs
you might want to use the -l
option, such as -l 6
to go six deep:
-l depth
--level=depth
Specify recursion maximum depth level depth.
Also, note that earlier versions of Wget had trouble with assets found in CSS that were in turn linked by @import url
, as reported in: wget downloads CSS @import, but ignores files referenced within them. But they didn't say what version they used, and I didn't test the latest version yet. My workaround at the time was to manually figure out what assets were missing and write a separate Wget command specifically for those missing assets.