3

I'm using

my $ua = new LWP::UserAgent;
$ua->agent("Mozilla/5.0 (Windows NT 6.1; Intel Mac OS X 10.6; rv:7.0.1) Gecko/20100101 Firefox/7.0.1 ");
my $url = "http://somedomain.com/page/";
my $req = new HTTP::Request 'GET' => $url;
$req->header('Accept' => 'text/html');
my $response = $ua->request($req);
my $html = $response->decoded_content;

to get a web page. On this page, Abobo's Big Adventure appears. In $request->content and $request->decoded_content, this is shown as Abobo's Big Adventure.

Is there something I can do to make this decode correctly?

Ivy
  • 887
  • 1
  • 7
  • 25

2 Answers2

5

Why, that is completely valid HTML! However, you can decode the Entities using HTML::Entities from CPAN.

use HTML::Entities;

...;
my $html = $response->decoded_content;
my $decoded_string = decode_entities($html);

The docs for HTTP::Response::decoded_content state that the Content-encoding and charsets are reversed, not HTML entities (which are a HTML/XML language feature, not really an encoding).

Edit:

However, as ikegami pointed out, decoding the entities immediately could render the HTML unparsable. Therefore, it might be best to parse the HTML first (e.g. using HTML::Tree), and then only decoding the text nodes when needed.

use HTML::TreeBuilder;

my $url = ...;
my $tree = HTML::TreeBuilder->new_from_url($url);    # invokes LWP automatically
my $decoded_text = decode_entities($tree->as_text);  # dumps the tree as flat text, then decodes.
Community
  • 1
  • 1
amon
  • 57,091
  • 2
  • 89
  • 149
  • 1
    That doesn't produce HTML as your variable name implies. Take `

    <i>/foo<i>

    ` for example.
    – ikegami Dec 19 '12 at 23:03
  • Did you mean `decode_entities` instead of `deparse_entities`? Other than that, the tree did what I needed and also cleaned up my code nicely. The only thing is that I had to get the HTML with LWP separately and use `HTML::TreeBuilder->from_content($html)` because I didn't see a way to set the UserAgent string. Thank you! – Ivy Dec 20 '12 at 19:05
0

I'm guessing there probably is an ampersand there before the hash mark. Making it the HTML entity expressed &#39; These aren't that hard to change. You can do something like this:

my $content =  $response->decoded_content;
$content    
    =~ s{(&#(\d{2,3});)}{
           $2 < 128 ? ord( $2 ) : $1
        }gem
    ;

The range check pretty much assures you you're dealing with ASCII. If you want to get more complex, you could also put together a hash of values, and change it like so:

my %entity_lookup
    = ( 150 => '-'
      , 151 => '--' # m-dash
      , 160 => ' '
      ... 
    );
...
$content
    =~ s{(&#(\d+);)}{ 
           $2 < 128 ? ord( $2 ) : $entity_lookup{ $2 } // $1
        }gem
    ;

But that would be up to you.

Axeman
  • 29,660
  • 2
  • 47
  • 102
  • The correct way to do this is to use [HTML::Entities](https://metacpan.org/module/HTML::Entities) or a similar well-tested module. – friedo Dec 19 '12 at 21:49
  • I agree that a well-tested module would be the better approach, but I may not have the permissions I need to install additional modules from CPAN. So I'm happy to know there's a fallback. Thanks – Ivy Dec 20 '12 at 17:34