-4

Usualy I make scrapers in Ruby, but decide to do in Perl. And when I run my script I see number of url which opens very very very slow. And I thank, maybe its redirect problem? Or maybe its JS urls thats why problem. And I decide to use some module which can open JS web sites. So I look to cpan doc, take code and try to run it. Nothing no content. What I do wrong? Please correct me. Or maybe advice me smth. I try to use Selenium but have problems with installation, see error when I try to run selenium in Linux console.

use WWW::Scripter;

  $w = new WWW::Scripter;  
  $w->use_plugin('JavaScript');

   open(FH, "<links.csv");
   while (<FH>) {
    $url =  $_;

    if ( $url !~ /http(s)/ ) {
        $url = "http://".$url;
    }

    $w->get(url);
    $html = $w->content;

    print "=======\n";
    print Dupmper $w->content;
    print "=======\n";
}
  • Besides that, there is WWW::Mechanize::Firefox and ::Chrome. They remote-control a browser Window on your machine, so you need X when you're on Linux. ::Chrome was released like a week ago and is still not very feature-complete. It should support a headless more where you do not need a window manager. You can use it to pull out the final source code after JS stuff ran and work on that. – simbabque Jun 29 '17 at 10:31
  • When I try to use Mechanize Firefox I got error - ^[[Aroot@antonov:/var/www/html/work8# perl work8.pl Failed to connect to , problem connecting to "localhost", port 4242: Connection refused at /usr/local/share/perl/5.22.1/MozRepl/Client.pm line 144 I should run some daemon before using this module? – rogersnest Jun 29 '17 at 11:21
  • You need to install a Firefox add on and start it so the Perl module can talk to Firefox. There is a troubleshooting or FAQ section in the pod that explao explains how to do that. It think it is at the end. I don't have a computer right now so I can't link it. The add on is called mozrepl – simbabque Jun 29 '17 at 11:24

2 Answers2

2
$w->get(url);

It's not url, it's $url. Use strict and warnings.

choroba
  • 231,213
  • 25
  • 204
  • 289
1

Firstly, you should always use strict and use warnings in your Perl programs. They would have picked up your typo.

Secondly, you should have checked the return code from get() as that would have shown you there was something wrong.

Thirdly, there are few outdated Perl programming practices in your code.

# Always use these
use strict;
use warnings;

use WWW::Scripter;
# Added this
use Data::Dumper;

# Don't use indirect object notation.
# Declare variable
my $w = WWW::Scripter->new;
$w->use_plugin('JavaScript');

# Three-arg version of open()
# Lexical filehandle
# Check result of open() and die on failure
open(my $in_fh, '<', 'links.csv') or die $!;

while (<$in_fh>) {
  # Fixed regex.
  # 1/ Anchored at start of string
  # 2/ Made 's' optional (and non-captured)
  if ( ! /^https?/ ) {
    # Use string interpolation
    $url = "http://$url";
  }

  # Capture HTTP response
  # Use '$url', not 'url'
  my $resp = $w->get($url);

  # Check request is successful
  unless ($resp->is_success) {
    # If not, warn and die
    warn $resp->status_line;
    next;
  }

  print "=======\n";
  print Dumper $w->content;
  print "=======\n";
}
Dave Cross
  • 68,119
  • 3
  • 51
  • 97
  • Well. It stange. It works. But when it try to parse HTML another time, I got very big error. And with LWP I dont have such error. All working pretty good, but big very big pauses when get HTML from some sites. Next is error text for you. Use of uninitialized value $ms in division (/) at /usr/local/share/perl/5.22.1/WWW/Scripter.pm line 825. – rogersnest Jun 29 '17 at 12:15
  • I can publish full text of error here. But I can give you link to full error text. https://justpaste.it/18e33 – rogersnest Jun 29 '17 at 12:17
  • If you have extra useful information to share, then please edit your question to add it. – Dave Cross Jun 29 '17 at 12:18
  • Try to think why, why, why I can have very big pauses when try to get HTML from url. Up to hour may be more. Why? – rogersnest Jun 29 '17 at 12:23
  • Maybe problem with redirect? – rogersnest Jun 29 '17 at 12:24
  • @DenisAntonov Why do you keep suggesting completely unrelated edits to my answer? That's incredibly rude. – Dave Cross Jun 30 '17 at 10:26