1

I am trying to retrieve a page which uses js and database to load. The loading takes about 2 to 3 mins. I am able to get the page where it would show "Please wait 2 to 3 mins for the page to be loaded." But not able to retrieve the page after it is loaded.

I have already tried the following:

1.) Using mirror method in the Mechanize. But the response content is not decoded. Hence the file is gibberish. (Also tried to write a similar method as mirror method which would decode the response content but that also doesnt work. The New content is not loaded.)

2.) Tried to add a request header 'if-modified-since'. But still the time is same and the new content is not fetched.

Any pointers or suggestions would really be helpful.

TIA :)

b0b
  • 33
  • 5
  • You do realize that [WWW::Mechanize doesn't support JavaScript](https://metacpan.org/pod/WWW::Mechanize::FAQ#JavaScript), right? You can use [WWW::Mechanize::Firefox](https://metacpan.org/pod/WWW::Mechanize::Firefox) instead. – ThisSuitIsBlackNot Aug 04 '14 at 23:45
  • Yes. I do know. I only want to retrieve the html in the page. (After it is entirely loaded) – b0b Aug 04 '14 at 23:49
  • Is the JavaScript not modifying the DOM? – ThisSuitIsBlackNot Aug 05 '14 at 00:00
  • It is modifying the DOM and the changes can be seen in the browser but while retrieving through the Mechanize the page where it says "Pls wait for 1 to 2 mins" is loaded. – b0b Aug 05 '14 at 00:23
  • `WWW::Mechanize` is not a browser. None of the DOM changes made by JavaScript code loaded in your web browser will be visible to Mech. If the page works by making AJAX calls to a server and then creating DOM elements on-the-fly for displaying the results, you won't be able to see them with Mech. – ThisSuitIsBlackNot Aug 05 '14 at 00:32

1 Answers1

2

It wont work with Mechanize itself, you need to check first what javascript is doing to the page, and from where the data are coming from. Then, 2 possibilities :

  • You mimic the javascript in perl after you get the data before load, and from where javascript is downloading the new data. See if the data are somewhat encoded, and decode it with perl.
  • You use Mech Firefox, then you do not need to care about javascript as it will be handled by Firefox. You can hide the application if you do not want to see it.

Example :

use WWW::Mechanize::Firefox;
use HTML::TreeBuilder::LibXML;
my $mech = WWW::Mechanize::Firefox->new;
$mech->get('http://example.com/ajax.html');
my $tree = HTML::TreeBuilder::LibXML->new;
$tree->parse($mech->content);
$tree->eof;
my $something = $tree->findvalue('/html/body/div[10]/table');

Above code is not tested, but should work.

Enjoy.

user2360915
  • 1,100
  • 11
  • 30