Can I ask Perl 6's LWP::Simple to handle malformed UTF-8?

Question

I'm using LWP::Simple to fetch a webpage that has a couple of malformed characaters in it. My call to .get blows up on that. Rather than that, I'd like to have the decode insert replacement characters in the confused parts and keep going.

It looks like the response is a Buf object and using that .decode. I'm still investigating, but the lack of documentation is making this more difficult than it should be.

Where do you find the documentation is lacking? Maybe on handling utf8 strings in the Perl 6 documentation? — jjmerelo, May 13 '18 at 06:27
@jjmerelo Note that jnthn redid/improved the encoding API a few days/weeks after brian wrote this question and then updated the docs. On the other hand, it still doesn't look clear to me what does or doesn't happen based on the doc. On the gripping hand, actually trying it might well show that it all works fine now. See also my answer update. — raiph, May 13 '18 at 14:11

score 1 · Answer 1 · answered May 28 '17 at 08:05

If I understand LWP::Simple's example script and implementation correctly, I think you're meant to handle a case like this by either...

Setting .force_encoding to use a less strict encoding:

use LWP::Simple;
my $lwp = LWP::Simple.new;

$lwp.force_encoding = 'utf8-c8';
say $lwp.get('http://www.google.com');

utf8 (the default) = UTF8, with invalid bytes causing an exception.
utf8-c8 = UTF8 with pass-through for invalid bytes.

Setting .force_no_encode to get the result as a Buf:

use LWP::Simple;
my $lwp = LWP::Simple.new;

$lwp.force_no_encode = True;
say $lwp.get('http://www.google.com');

I can't test it though, because LWP::Simple (installed with zef) doesn't work at all for me. (Not sure if the problem is with my Perl 6 set-up.)

My impression is that this module is not very polished right now. It's not just the lack of documentation – the API also appears to have been partially cargo-cult copied from the Perl 5 module (even parts that make less sense in Perl 6), and partially evolved by different committers adding features here and there without much design focus.

The utf-c8 encoding doesn't work here because it works to preserve oddities in the decode, and force_no_encode returns a buffer that I'd still need to decode. I don't think it's a problem with LWP::Simple so much as Perl 6's limited ability to decode. — brian d foy, May 28 '17 at 14:14

score 1 · Answer 2 · edited Jun 20 '20 at 09:12

May 2018 update

This is hopefully slightly better than no update at all. I hope to find out more and then replace this with a simpler update when there's something more useful to say.

jnthn committed a new encoder API a few weeks after brian wrote his question.
There have been subsequent commits mentioning "replacement" (mostly about Unicode replacement characters).
What looks to me like the relevant doc for built in Perl 6 decoding control doesn't mention replacement characters even though it does for encoding control doc ("Built-in encodings now all support ... either a Str replacement sequence or True to use a default replacement sequence for unencodable characters" and even though what looks to me like the relevant Rakudo source code shows use of a :replacement adverb in both decoder and encoder methods.

In the meantime, I don't see any commit to LWP::Simple that relates to this. That said, perhaps the Buf and decode solution now works?

From #perl6 earlier today:

does the decoder API provide an option to choose whether to throw an error or insert � when it finds invalid bytes?

jnthn's answer was:

At the moment it always throws an error

Until now [it wasn't a good time to enable that option]

Whereas now [is a better time to improve the encoder]

Can I ask Perl 6's LWP::Simple to handle malformed UTF-8?

2 Answers2

May 2018 update