Can Encode::Guess tell utf-8 from iso-8859-1?

Question

I have a string $data, encoded in utf-8. I assume that I don't know whether this string is utf-8 or iso-8859-1. I want to use the Perl Encode::Guess module to see if it's one or the other. I'm having trouble figuring out how this module works.

I have tried the four following methods (from http://perldoc.perl.org/Encode/Guess.html) :

use Encode::Guess qw/utf8 latin1/;

my $decoder = guess_encoding($data);

print "$decoder\n";

Result: iso-8859-1 or utf8

use Encode::Guess qw/utf8 latin1/;

my $enc = guess_encoding($data, qw/utf8 latin1/);
ref($enc) or die "Can't guess: $enc";
my $utf8 = $enc->decode($data); 

print "$utf8\n";

Result: Can't guess: iso-8859-1 or utf8 at encodage-windows.pl line 25, line 18110.

use Encode::Guess qw/utf8 latin1/;

my $decoder = Encode::Guess->guess($data);
die $decoder unless ref($decoder);
my $utf8 = $decoder->decode($data);

print "$utf8\n";

Result: iso-8859-1 or utf8 at encodage-windows.pl line 30, line 18110.

use Encode::Guess qw/utf8 latin1/;

my $utf8 = Encode::decode("Guess", $data);

print "$utf8\n";

Result: iso-8859-1 or utf8 at /usr/local/lib/perl5/Encode.pm line 175.

My first question is: which one of these methods am I supposed to use (if any)? And my second question: what changes should I make to make this work?

Using Encode::Guess is overkill. See http://stackoverflow.com/a/22868803/589924 — ikegami, Apr 11 '14 at 14:30
@ikegami Is it overkill in the case of utf-8 versus latin1, or overkill in general? It seemed more straightforward to use a module than to try to decode it, but I could be mistaken. — kormak, Apr 11 '14 at 14:51
Text containing only ASCII characters (in the range 0..127) is valid ASCII, valid Latin-1, and valid UTF-8. — Keith Thompson, Apr 11 '14 at 15:24
@kormak, No, that won't work with arbitrary encodings, just with encodings where you don't have to guess based on content. — ikegami, Apr 11 '14 at 16:09

Borodin · Accepted Answer · 2014-04-11T14:46:11.547

6

I normally check the possible encodings one at a time, like this

my $decoder = guess_encoding($data, 'utf8');
$decoder = guess_encoding($data, 'iso-8859-1') unless ref $decoder;
die $decoder unless ref $decoder;

printf "Decoding as %s\n\n", $decoder->name;
$data = $decoder->decode($data);

If possible it chooses UTF-8, otherwise it tries ISO-8859-1, and either chooses that or errors, so it becomes a simple yes/no result for each encoding and there is no way for it to come up with two possible results (which is the error you're getting).

edited Apr 11 '14 at 14:46

answered Apr 11 '14 at 14:34

Borodin

126,100
9
70
144

Thanks, that worked perfectly! Your explanation makes things clearer for me now. :) – kormak Apr 11 '14 at 14:46
1

@kormak: I'm glad to help. You may want to be careful though, as this way it just uses UTF-8 if the encoding is ambiguous. That may not be the right thing to do in your situation, and perhaps there are some other cues you could check – Borodin Apr 11 '14 at 14:49

Can Encode::Guess tell utf-8 from iso-8859-1?

1 Answers1