0

I have the following command to copy the website,

as it tried to hit sun.com it got connection timed out.

I would like the wget to exclude the sun.com so that wget would proceed to the next thing.

Exisitng Issue

$ wget --recursive --page-requisites --adjust-extension --span-hosts --convert-links --restrict-file-names=windows http://pt.jikos.cz/garfield/
.
.
2021-08-09 03:28:28 (19.1 MB/s) - ‘packages.debian.org/robots.txt’ saved [24/24]

2021-08-09 03:28:30 (19.1 MB/s) - ‘packages.debian.org/robots.txt’ saved [24/24]
.


Location: https : //packages. debian. org /robots.txt [following]
--2021-08-09 03:28:33--  https : //packages. debian. org /robots.txt
Connecting to packages.debian.org (packages.debian.org)|128.0.10.50|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 24 [text/plain]
Saving to: ‘packages.debian.org/robots.txt’

packages.debian.org 100%[===================>]      24  --.-KB/s    in 0s

2021-08-09 03:28:34 (19.1 MB/s) - ‘packages.debian.org/robots.txt’ saved [24/24]

Loading robots.txt; please ignore errors.
--2021-08-09 03:28:34--  http ://wwws. sun. com/ robots.txt
Resolving wwws.sun.com (wwws.sun.com)... 137.254.16.75
Connecting to wwws.sun.com (wwws.sun.com)|137.254.16.75|:80... failed: Connection timed out.
Retrying.

--2021-08-09 03:28:56--  (try: 2)  http ://wwws. sun. com/ robots.txt
Connecting to wwws.sun.com (wwws.sun.com)|137.254.16.75|:80... failed: Connection timed out.
Retrying.

--2021-08-09 03:29:19--  (try: 3)  http ://wwws. sun. com/ robots.txt
Connecting to wwws.sun.com (wwws.sun.com)|137.254.16.75|:80... failed: Connection timed out.
Retrying.

--2021-08-09 03:29:43--  (try: 4)  http ://wwws. sun. com/ robots.txt
Connecting to wwws.sun.com (wwws.sun.com)|137.254.16.75|:80... failed: Connection timed out.
Retrying.

--2021-08-09 03:30:08--  (try: 5)  http ://wwws. sun. com/ robots.txt
Connecting to wwws.sun.com (wwws.sun.com)|137.254.16.75|:80... failed: Connection timed out.
Retrying.

--2021-08-09 03:30:34--  (try: 6)  http ://wwws. sun. com/ robots.txt
Connecting to wwws.sun.com (wwws.sun.com)|137.254.16.75|:80... failed: Connection timed out.
Retrying.

--2021-08-09 03:31:01--  (try: 7)  http ://wwws. sun. com/ robots.txt
Connecting to wwws.sun.com (wwws.sun.com)|137.254.16.75|:80... failed: Connection timed out.
Retrying.

Expected $wget to save the whole website without timeouts, if there are timeouts then wget would skip the timeout connections.

bal
  • 1
  • 1
  • 2

1 Answers1

2

Please read the fine manual about the "risks" of using the --span-hosts (-H) option and how to limit those by adding restrictions:
https://www.gnu.org/software/wget/manual/wget.html#Spanning-Hosts

The --span-hosts or -H option turns on host spanning, thus allowing Wget’s recursive run to visit any host referenced by a link. Unless sufficient recursion-limiting criteria are applied, these foreign hosts will typically link to yet more hosts, and so on until Wget ends up sucking up much more data than you have intended.

...

Limit spanning to certain domains -D
The -D option allows you to specify the domains that will be followed, thus limiting the recursion only to the hosts that belong to these domains.

...

Keep download off certain domains --exclude-domains
If there are domains you want to exclude specifically, you can do it with --exclude-domains, which accepts the same type of arguments of -D, but will exclude all the listed domains.

bob
  • 191
  • 1