3

Problem: I have a list of 2500 websites and need to grab a thumbnail screenshot of them. How do I do that? I could try to parse the sites either with Perl.- Mechanize would be a good thing. Note: i only need the results as a thumbnails that are a maximum 240 pixels in the long dimension. At the moment i have a solution which is slow and does not give back thumbnails: How to make the script running faster with less overhead - spiting out the thumbnails

Prerequisites: addon/mozrepl/ the module WWW::Mechanize::Firefox; the module imager

First Approach: Here is a first Perl solution:

 use WWW::Mechanize::Firefox;
 my $mech = WWW::Mechanize::Firefox->new();
 $mech->get('http://google.com');
 my $png = $mech->content_as_png();

Outline: This returns the given tab or the current page rendered as PNG image. All parameters are optional. $tab defaults to the current tab. If the coordinates are given,that rectangle will be cut out. The coordinates should be a hash with the four usual entries, left,top,width,height.This is specific to WWW::Mechanize::Firefox.

As i understand from the perldoc that option with the coordinates, it is not the resize of the whole page it's just a rectangle cut out of it.... well the WWW::Mechanize::Firefox takes care for how to save screenshots. Well i forgot to mention that i only need to have the images as small thumbnails - so we do not have to have a very very large files...i only need to grab a thumbnail screenshot of them. I have done a lookup on cpan for some module that scales down the $png and i found out Imager

The mecha-module does not concern itself with resizing images. Here we have the various image modules on CPAN, like Imager. Imager - Perl extension for Generating 24 bit Images: Imager is a module for creating and altering images. It can read and write various image formats, draw primitive shapes like lines,and polygons, blend multiple images together in various ways, scale, crop, render text and more. I installed the module - but i did not have extended my basic-approach

What i have tried allready; here it is:

#!/usr/bin/perl

use strict;
use warnings;
use WWW::Mechanize::Firefox;

my $mech = new WWW::Mechanize::Firefox();

open(INPUT, "<urls.txt") or die $!;

while (<INPUT>) {
        chomp;
        print "$_\n";
        $mech->get($_);
        my $png = $mech->content_as_png();
        my $name = "$_";
        $name =~s/^www\.//;
        $name .= ".png";
        open(OUTPUT, ">$name");
        print OUTPUT $png;
        sleep (5);
}

Well this does not care about the size:

See the output commandline:

linux-vi17:/home/martin/perl # perl mecha_test_1.pl
   www.google.com
    www.cnn.com
    www.msnbc.com
command timed-out at /usr/lib/perl5/site_perl/5.12.3/MozRepl/Client.pm line 186
linux-vi17:/home/martin/perl # 

This is my source ... see a snippet [example]of the sites i have in the url-list.

urls.txt [the list of sources ]

www.google.com
www.cnn.com
www.msnbc.com
news.bbc.co.uk
www.bing.com
www.yahoo.com

Question: how to extend the solution either to make sure that it does not stop in a time out. and - it does only store little thumbnails Note:again: i only need the results as a thumbnails that are a maximum 240 pixels in the long dimension. As a prerequisites, i allready have installed the module imager

How to make the script running faster with less overhead - spiting out the thumbnails

Love to hear from you! greetings zero

Update: in addition to Schwerms idea which are very very intersting i found a intersting Monkthread which talks about the same timeouts:

Is there a way to specify Net::Telnet timeout with WWW::Mechanize::Firefox? At the moment my internet connection is very slow and sometimes I get error with

 $mech->get(): command timed-out at /usr/local/share/perl/5.10.1/MozRepl/Client.pm line 186

Perhaps i have to loook after the mozrepl-Timeout-configuration!? But after all: This is weird and I don't know where that timeout comes from. Maybe it really is Firefox timing out as it is busy synchronously fetching some result. As you see in the trace, WWW::Mechanize::Firefox polls every second (or so) to see whether Firefox has fetched a page.

If it really is Net::Telnet, then you'll have to dive down:

$mech->repl->repl->client->{telnet}->timeout($new_timeout);

** Update** so the questions are: i mage usage of ** Net::Telnet:** which is in the Perl-Core

@ Alexandr Ciornii: thx for the hint! subsequently i would do it like this use: Net::Telnet; but if it is not in the core then i cannot go like this. @ Daxim: $ corelist Net::Telnet␤␤Net::Telnet was not in CORE - that means i cannot go like above

btw: like Øyvind Skaar mentioned: With that many url's we have to expect that some will fail and handle that. For example, we put the failed ones in an array or hash and retry them X times.

zero
  • 1,003
  • 3
  • 20
  • 42
  • 1
    I'd use fork() for multiprocessing to speed things up a bit... – Alex Ackerman Feb 20 '12 at 00:00
  • Hello Alex - what bout grabbbing very very small images (/Thumbnails) does this spead up the process!? Note i do not need big images. Very very Small images fit very well... What do you guess!? Note - i need some ideas how to apply imager – zero Feb 20 '12 at 00:11
  • 1
    With that many url's you have to expect that some will fail and handle that. For example, put the failed ones in an array or hash and retry them X times. – Øyvind Skaar Feb 20 '12 at 09:05
  • hi Øyvind Skaar thx for the comment - guess that i can fix the timeout with $mech->repl->repl->client->{telnet}->timeout($new_timeout); - Your idea sounds convincing! how would you do this with the array or hash - can you explain - [just edit above ] thx for the help in advance – zero Feb 20 '12 at 09:20
  • hi buddies thx for the reply - you are just great! BTW what about Image::Magick::Thumbnail - Produces thumbnail images with ImageMagick well i guess that this does not run into timeouts - but i am not very sure!? What do you think - i do some investigations.... i come back and report all my findings.. greetings and many many thanks for all you did! – zero Feb 21 '12 at 04:57

1 Answers1

5

Look into Parallel::ForkManager which is one of the easier and more reliable ways to do parallel processing in Perl. Most of your work will be network and I/O bound, your CPU will be waiting around for the remote web server to return, and you're likely to get some big wins.

As for the timeout, that's somewhere inside MozRepl and defaults to 10 seconds. You'd either have to create a MozRepl::Client object with a different timeout and somehow get WWW::Mechanize::Firefox to use it, or you can do some undocumented things. This perlmonks thread shows how to change the timeout. There's also an undocumented MOZREPL_TIMEOUT environment variable which you can set.

Schwern
  • 153,029
  • 25
  • 195
  • 336
  • hello Schwern ,man many thanks for the hints! i do as adviced! greetigns – zero Feb 20 '12 at 04:53
  • hello again schwern: btw: there is a Monksthread http://www.perlmonks.org/?node_id=901572 which talks about the same timeouts: Is there a way to specify Net::Telnet timeout with WWW::Mechanize::Firefox? At the moment my internet connection is very slow and sometimes I get error with $mech->get(): command timed-out at /usr/local/share/perl/5.10.1/MozRepl/Client.pm line 186 Well i think this is very weird and I don't know where that timeout comes from. Maybe it really is Firefox timing out as it is busy synchronously fetching some result. – zero Feb 20 '12 at 07:50
  • well what do you think - the monks talk about so: it really is Net::Telnet, then you'll have to dive down: $mech->repl->repl->client->{telnet}->timeout($new_timeout); so the question is: do i need the Net::Telnet: http://search.cpan.org/~jrogers/Net-Telnet-3.03/lib/Net/Telnet.pm Thats the question - do i need it or not!? Can i run the code with or without Net::Telnet: ???? – zero Feb 20 '12 at 08:38
  • 1
    zero: Net::Telnet is bundled with Perl – Alexandr Ciornii Feb 20 '12 at 10:27
  • 1
    `$ corelist Net::Telnet␤␤Net::Telnet was not in CORE (or so I think)` – daxim Feb 20 '12 at 10:40
  • @ Alexandr Ciornii: thx for the hint! subsequently i would do it like this use: Net::Telnet; but if it is not in the core then i cannot go like this. @ Daxim: $ corelist Net::Telnet␤␤Net::Telnet was not in CORE - that means i cannot go like above **btw**: like Øyvind Skaar mentioned: With that many url's we have to expect that some will fail and handle that. For example, we put the failed ones in an array or hash and retry them X times. – zero Feb 20 '12 at 11:57
  • hi buddies thx for the reply - you are just great! BTW what about Image::Magick::Thumbnail - Produces thumbnail images with ImageMagick well i guess that this does not run into timeouts - but i am not very sure!? What do you think - i do some investigations.... i come back and report all my findings.. greetings and many many thanks for all you did! – zero Feb 21 '12 at 04:57