4

I'm currently working on a Perl script and I use the CPAN module WWW:Mechanize to get HTML pages from websites. However, I would like to be able to work on offline HTML files as well (that I would save myself beforehand most likely) so I don't need the internet each time I'm trying a new script. So basically my question is how can I transform this :

$mech->get( 'http://www.websiteadress.html' );

into this :

$mech->get( 'C:\User\myfile.html' );

I've seen that file:// could be useful but I obviously don't know how to use it as I get errors every time.

Azaghal
  • 430
  • 4
  • 12
  • 1
    Are you sure you want to use `WWW::Mechanize` on a local file? There is little point in using the `LWP` suite when you can just `open` the file, and much of the purpose of the module is handling clicks on links, form filling and submission, and emulating the back and forward buttons on your browser. None of these are possible with a static file, so you are left with just analysis of the page, for which you need just [HTML::TreeBuilder](https://metacpan.org/pod/HTML::TreeBuilder) which `WWW::Mechanize` subclasses. – Borodin Aug 04 '16 at 22:00
  • As I said the purpose of my script is to work on online pages, local files are just an alternative, mainly for testing coding errors (And I really wanted to know why it didn't work ! ). Thanks for pointing out another way to do it though. – Azaghal Aug 05 '16 at 07:34

1 Answers1

6

The get() method from WWW::Mechanize takes a URL as its argument. So you just need to work out what the correct URL is for your local file. You're on the right lines with the "file://" scheme.

I think you will need:

$mech->get( 'file:///C:/User/myfile.html' );

Note two important things that people often get wrong.

  1. URLs only understand forward slashes (/), so you need to convert Windows' warped backslash (\) monstrosities. Update: As Borodin points out in a comment, this isn't true - you can use backslashes in URLs. However, backslashes often have special meanings in Perl strings, so I'd advise using forward slashes whenever possible.
  2. The scheme is file, which is followed by :// (with two slashes), then the hostname (which is an empty string) a slash (/) and then your local path (C:/). So that means that there are three slashes after file:. That seems wrong, so people often omit one of them. Update: description made more accurate following advice from Borodin in a comment.

Wikipedia (as always) has a lot more information - file URI scheme

Dave Cross
  • 68,119
  • 3
  • 51
  • 97
  • 3
    *"URLs only understand forward slashes"* The contents of a `file:` URI are platform-defined. `file:///C:\Temp\t.txt` works just fine. *"Windows' warped backslash (\\) monstrosities"* This isn't the place for tribalism. Please just answer the question. – Borodin Aug 04 '16 at 22:14
  • 4
    *"The scheme is file://"* Not quite. The scheme is `file`. Within a URI it that it must be followed by a colon and two slashes, then the *host* (in this case it's an empty string, indicating the local machine) another slash and the path. – Borodin Aug 04 '16 at 22:15