1

I have a website that has a bunch of absolute addresses, and I need to move it (the whole contents of that website) up a level, so all of the absolute links need to be converted to relative.

I know about wget with --convert-links, but it doesn't work in my case. My website is actually mirrored with wget, but when I re-scrape it with the timestamp option to get updates --convert-links doesn't work properly. Is there another way to go about it?

Also, the website is massive, so redownloading it with another mirroring tool is highly undesirable.

The website is hosted with Apache 2.0, but I do not have access to the server configuration.

Ryan Babchishin
  • 6,260
  • 2
  • 17
  • 37
BubbleFish
  • 13
  • 1
  • 3

1 Answers1

4

You can do a search and replace for each document in your site to get what you need. Perhaps using regex patterns. On Linux/Unix sed among other command line tools could help.

I'm not sure why you're talking about wget and mirroring tools. Don't you have access to the files?

This tool says it does exactly what you want:

http://www.perlmonks.org/?node_id=56338

Change Absolute to Relative links in HTML files

This utility will recurse through a specified directory, parse all the .htm and .html files, and replace any absolute URL's with relative URL's to a base you define.

You can also specify what types of links to parse: img, src, action, or any others. Please see HTML::Tagset's %linkElements hash, in the module's source, for a precise breakdown of supported tag-types.

This program was good practice for trying out Getopt::Declare, an excellent command-line parser. Please note the parameter specification below the DATA tag.

Disclaimer: Always use the -b switch to force backups, just in case you have non-standard HTML and the HTML::TreeBuilder parser mangles it.

Comments and suggestions for improvement are always welcome and very much appreciated.

In case the link stops working, here is the code:

#!/usr/bin/perl -w

use strict;
use Getopt::Declare;
use Cwd;
use File::Find;
use HTML::TreeBuilder;
use URI::URL qw(url);
use File::Copy;
use IO::File;

use vars qw($VERSION $BASE_URL $BACKUP $DIRECTORY @WANTED_TYPES);

$VERSION = (qw$Revision: 1.1 $)[1];

#Load the definition and grab the command-line parameters
my $opts = Getopt::Declare->new( do{ local $/; <DATA> } );

#Cycle through this directory, and all those underneath it
find(\&wanted, $DIRECTORY || getcwd);

#Parse each HTML file and make a backup
#of it using File::Copy::copy.
sub wanted {
  return unless $File::Find::name =~ /html?$/;

  #Extract Links from the file
  my $h = HTML::TreeBuilder->new;
  $h->parse_file($File::Find::name);

  my $link_elements = $h->extract_links(@WANTED_TYPES);
  return unless @$link_elements;

  #Grab each img src and re-write them so they are relative URL's
  foreach my $link_element (@$link_elements) {
    my $link    = shift @$link_element; #URL value
    my $element = shift @$link_element; #HTML::Element Object

    my $url = url($link)->canonical;
    next unless $url->can('host_port') and
      $BASE_URL->host_port eq $url->host_port;

    #Sigh.. The following is because extract_links() doesn't
    #tell you which attribute $link belongs to, except to say
    #it is the value of an attribute in $element.. somewhere.

    #Given the $link, find out which attribute it was for
    my ($attr) = grep {
      defined $element->attr($_) and $link eq $element->attr($_)
    } @{ $HTML::Tagset::linkElements{$element->tag} };

    #Re-write the attribute in the HTML::Element Tree
    #Note: $BASE_URL needs to be quoted here.
    $element->attr($attr, $url->path("$BASE_URL"));
  }

  #Make a backup of the file before over-writing it
  copy $File::Find::name => $File::Find::name.'.bak'
    if defined $BACKUP;

  #Save the updated file
  my $fh = IO::File->new($File::Find::name, O_RDWR)
    or die "Could not open $File::Find::name: $!";
  $fh->print($h->as_HTML);
}

__DATA__
#If there is an error here, you need to have one tab
#between the <$var> and the option description.
 -u <base_url>         Base URL (http://www.yoursite.com) [required]
                      { $BASE_URL = url($base_url)->canonical }
 -b                    Backup changed files
                      { $BACKUP = 1  }
 -d <directory>        Starting Directory to recurse from
                      { $DIRECTORY = $directory }
 -l <links>...         Links to process: img, href, etc [required]
                      { @WANTED_TYPES = @links }
Ryan Babchishin
  • 6,260
  • 2
  • 17
  • 37