How can I avoid encoding errors using Net::OpenID::Consumer with Yahoo OpenIDs?

Question

I've written a Dancer web app that utilizes Net::OpenID::Consumer to consume OpenIDs for authentication. It works well with Google and MyOpenID, but not Yahoo. When a user tries to authenticate using their Yahoo account, HTML::Parser warns:

Parsing of undecoded UTF-8 will give garbage when decoding entities

and this warning kills my app (rightfully so).

I don't see any existing bugs with Net::OpenID::Consumer (or Common) that relate to this.
The HTTP headers and the HTML meta tags both specify UTF-8 for the 'claimed id' URI.
Why would the response not be decoded for HTML::Parser? Am I missing something obvious?

Here's the relevant code:

get '/openid_landing' => sub {
    my $params = params();
    my $csr = Net::OpenID::Consumer->new(
        ua => LWP::UserAgent->new(),
        consumer_secret => $secret,
        params => $params,
    );  
    my $id = $params->{'openid.claimed_id'};

    if (my $setup_url = $csr->user_setup_url) {
        redirect $setup_url;

    } elsif ($csr->user_cancel) {
        redirect uri_for('/');

    } elsif (my $vident = $csr->verified_identity) {
       # verified identity, log in or register user
       ...

    } else {
        die "Error validating identity: " . $csr->err;
    } 
};

[Show your code](http://sscce.org) so that people may [reproduce the problem](http://www.chiark.greenend.org.uk/~sgtatham/bugs.html#showmehow). — daxim, Jun 25 '12 at 16:13
Sounds like you didn't decode the HTML before passing it to Parser, so decode it. If this was LWP, I'd say use `->decoded_content` instead of `->content`. — Ωmega, Jun 25 '12 at 16:25
user1215106, Net::OpenId::Common is grabbing and parsing the HTML, not my code. — kbosak, Jun 25 '12 at 16:26

score 1 · Accepted Answer · answered Jul 03 '12 at 18:46

The bug is in Net/OpenID/URIFetch.pm on lines 122-128 of version 1.14 (latest) It's using the raw content instead of the decoded content of the response object. Just remove the manual gzip decoding and use the decoded_content method in the response.

I haven't filed a bug report yet, feel free. :)

Here's a diff you can apply to fix it:

122c122
<         my $content = $res->decoded_content;
---
>         my $content = $res->content;
125a126,129
>         if ($res->content_encoding && $res->content_encoding eq 'gzip') {
>             $content = Compress::Zlib::memGunzip($content);
>         }
>

Awesome, thanks! I'll file a bug report on this soon and will link this post. — kbosak, Jul 04 '12 at 14:54

score 0 · Answer 2 · edited Apr 27 '14 at 17:45

It comes from the HTML::Parser module that is used by TreeBuilder under the hood, before you passing the page contents to TreeBuilder, feed them through decode_utf8:

use HTML::TreeBuilder;
use Encode;
my $contents = ...;
my $htree = HTML::TreeBuilder->new_from_content(decode_utf8 $contents);

For more :

http://metacpan.org/pod/HTML::TreeBuilder#new-from-content

http://search.cpan.org/dist/HTML-Parser/Parser.pm

How can I avoid encoding errors using Net::OpenID::Consumer with Yahoo OpenIDs?

2 Answers2