3

It outputs only a few lines from the beginning.

#!/usr/bin/perl

use strict;
use warnings;
use LWP::UserAgent;

my $ua = LWP::UserAgent->new;
my $response = $ua->get('http://www.eurogamer.net/articles/df-hardware-wii-u-graphics-power-finally-revealed');
print $response->decoded_content;
  • 3
    I got the complete html page running exactly what you typed above. What was your output? Did it include an error message of any sort? – jmcneirney Feb 06 '13 at 22:49
  • Here is the output I get: http://pastebin.com/wVcNBJeg –  Feb 07 '13 at 10:17

2 Answers2

6

I ran the following modification:

my $response = $ua->get( 'http://www.eurogamer.net/articles/df-hardware-wii-u-graphics-power-finally-revealed' );
say $response->headers->as_string;

And saw this:

Cache-Control: max-age=60s
Connection: close
Date: Wed, 06 Feb 2013 23:51:15 GMT
Via: 1.1 varnish
Age: 0
Server: Apache
Vary: Accept-Encoding
Content-Length: 50519
Content-Type: text/html; charset=ISO-8859-1
Client-Aborted: die
Client-Date: Wed, 06 Feb 2013 23:50:50 GMT
Client-Peer: 94.198.83.18:80
Client-Response-Num: 1
X-Died: Illegal field name 'X-Meta-Twitter:card' at .../HTML/HeadParser.pm line 207.
X-Varnish: 630361704

It doesn't seem to like the <meta name="twitter:card" content="summary" /> tag on line 27. It says that it died.

It seems to translate any meta tag with a name attribute to a "X-Meta-\u$attr->{name}" "header". It then tries to store the value of the content attribute as the X-meta "header" value. Like this (starting at line 194):

if ($tag eq 'meta') {
    my $key = $attr->{'http-equiv'};
    if (!defined($key) || !length($key)) {
        if ($attr->{name}) {
            $key = "X-Meta-\u$attr->{name}"; # <-- Here's the little trick
        } elsif ($attr->{charset}) { # HTML 5 <meta charset="...">
            $key = "X-Meta-Charset";
            $self->{header}->push_header($key => $attr->{charset});
            return;
        } else {
            return;
        }
    }
    $self->{'header'}->push_header($key => $attr->{content});
}

I pushed a modified copy of this module into a PERL5LIB directory. I wrapped the push_header step in an eval block and downloaded the page completely.

Borodin
  • 126,100
  • 9
  • 70
  • 144
Axeman
  • 29,660
  • 2
  • 47
  • 102
  • Upgrading HTML::Parser might help, see: https://rt.cpan.org/Ticket/Display.html?id=85119 – Slaven Rezic Jul 19 '13 at 13:10
  • @Borodin, what you deleted *wasn't* "tribalism", but a simple acknowledgement that when Perl programs meet a condition they can't handle they *`die`*--it's what they *do*. The term "died" is appropriate for a Perl program. If it was caught within an `eval` block, a Perlish way to identify it *might* be the `X-Died` header. – Axeman Jul 19 '13 at 14:40
  • @SlavenRezic, indeed, it does appeared that they handled it. My hack was only to demonstrate that it was mainly the inability to handle the http-equiv tag that answered: "Why can't `LWP::UserAgent` get this site entirely?" If it had still failed to get the site, my change would have been indicative of nothing. – Axeman Jul 19 '13 at 14:46
  • @Axeman: And what do other languages do when they meet an unhandled exception? – Borodin Jul 20 '13 at 09:03
  • @Borodin, they "die". You wouldn't get a X- *header* from a unhandled exception, though. Also "unhandled exception" and a "condition [the program] can't handle" can be two different things. One can be in a place where the programmer realizes he can't do anything from here and specifically calls `die "Error from subsystem: $@";` – Axeman Jul 22 '13 at 17:10
3

I had exactly the same problem...

I fixed it disabling the option 'parse_head' which enables the HTML::HeadParser.

    $self->{ua}->parse_head(0);

I know it is not a very good idea to disable this functionality but I prefer availability than correct decoded docs.

reto
  • 16,189
  • 7
  • 53
  • 67