0

Situation:
I want to mirror an old website. This website is on https://example.com/website/. The website uses absolute links to http://www.example.com/website/.

Problem:
For whatever reason, wget cannot reach https://www.example.com (the www. folder), the connection will just timeout - no idea why, it works fine in the browser (neither can curl btw).

Possible solutions:

  • Have wget rewrite the links before following them while it's still crawling.
  • Make wget work with the www. folder.

To maybe make .www work, I already tried setting the user-agent to FF: --header="Accept: text/html" --user-agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:95.0) Gecko/20100101 Firefox/95.0" but that did not work.

So I somehow need to rewrite the links on that website while crawling.

user136036
  • 141
  • 4
  • Not possible with pure wget. Find out why it times out. – Gerald Schneider Dec 19 '21 at 16:17
  • So are the links to `https:` or `http:` URLs.. you are talking about both. – vaizki Dec 19 '21 at 21:43
  • I have no idea how I could find out why www. does not work. wget/curl debug give no hint. The links are to http: but that does not really matter since HSTS enforces https:. The server works fine with https, also on the www. folder. If I run the same wget command from my home PC it downloads everything as expected (in my question I run wget from my server - but it's also not an IP block, because the non-www. stuff works (I usually crawl `https://example.com/site/` without issues)). – user136036 Dec 21 '21 at 05:10

0 Answers0