2

I need to mirror a website and deploy the copy under a different domain name. The mirroring procedure should be all automatic, so that I can update the copy on a regular basis with cron.

The mirror MUST NOT be a real mirror, but it MUST be static copy, e.g. a snaphot of the site at a specific time, so I think wget might fit.

As of now, I've come up with the following script to get a copy of the original site:

#!/bin/bash

DOMAIN="example.com"

cd /srv/mirrors
TMPDIR=$(mktemp -p . -d)
cd "${TMPDIR}"

wget -m -p -E --tries=10 --convert-links --retry-connrefused "${DOMAIN}"

cd ..
rm -rf oldcopy
mv "${DOMAIN}" oldcopy
mv "${TMPDIR}/${DOMAIN}" "${DOMAIN}"
rmdir "${TMPDIR}"

The resulting copy is then brought to you by Nginx under the new domain name, with a simple configuration for a local static site, and it seems to work.

Problem is the origin server produces web pages with absolute links in them, even when the links point to internal resources. E.g. a page at https://example.com/page1 contains

<link rel="stylesheet" href="https://example.com/style.css">
<script src="https://example.com/ui.js"/>

and so on (it's WordPress). No way I can change that behavior. wget then does not convert those links for local browsing, because they are absolute (or, at least, I think that's the cause).

EDIT: the real domain name is assodigitale.it, though I need a script that works regardless of the particular domain, because I will need it for a few other domains too.

Can I make wget convert those links to the new domain name?

Lucio Crusca
  • 420
  • 3
  • 12
  • 33
  • `wget -k` should convert the links to any pages that you have downloaded to a relative link. Why does it not work? Can you provide an example? – darnir Mar 23 '18 at 07:37
  • @darnir as you can see I'm already using `-k`, which is the same as `--convert-links`. The problem is that it is not converting absolute links, I suppose because they are absolute. – Lucio Crusca Mar 23 '18 at 11:20

2 Answers2

1

There is another solution to your problem.

Instead of making wget convert those links to the new domain name, you can make your webserver rewrite links on the fly.

with apache, you can use mod_sed to rewrite links.

eg :

AddOutputFilter Sed html OutputSed "s/example.com/newdomain.com/g"

https://httpd.apache.org/docs/trunk/mod/mod_sed.html

bgtvfr
  • 1,262
  • 10
  • 20
  • Thanks but that solution would completely defeat the purpose: I need snapshots to serve the pages as static files at maximum speed. Besides, I need to serve the pages with Nginx and I can't switch to Apache. – Lucio Crusca Mar 29 '18 at 08:20
0

Could this be a mixed content issue or otherwise related to using both HTTP & HTTPS protocols?

It might be that you are doing the mirror using HTTP

DOMAIN="example.com"
wget -m -p -E --tries=10 --convert-links --retry-connrefused "${DOMAIN}"

while the mentioned URLs to be converted are absolute HTTPS URLs:

<link rel="stylesheet" href="https://example.com/style.css">
<script src="https://example.com/ui.js"/>

The link conversion is the last phase of your command and it should show you lines giving detailed information on the conversion process. This is just an example from mirroring one page using your command:

Downloaded: 177 files, 12M in 0.2s (51.0 MB/s)
Converting links in example.com/index.html... 45-2
...
Converted links in 15 files in 0.008 seconds.

Only at the end wget will know what have been downloaded and it converts all the links it knows (from this download history) with the relative paths to the existing files. It's possible, that while wget is able to retrieve content using HTTP, it fails with HTTPS.

Try this:

DOMAIN="example.com"
wget -m -p -E --tries=10 --convert-links --retry-connrefused https://"${DOMAIN}"

It might either work or give you an error that helps you with solving the actual problem.

Esa Jokinen
  • 46,944
  • 3
  • 83
  • 129