6

I have a string $data, encoded in utf-8. I assume that I don't know whether this string is utf-8 or iso-8859-1. I want to use the Perl Encode::Guess module to see if it's one or the other. I'm having trouble figuring out how this module works.

I have tried the four following methods (from http://perldoc.perl.org/Encode/Guess.html) :

use Encode::Guess qw/utf8 latin1/;

my $decoder = guess_encoding($data);

print "$decoder\n";

Result: iso-8859-1 or utf8

use Encode::Guess qw/utf8 latin1/;

my $enc = guess_encoding($data, qw/utf8 latin1/);
ref($enc) or die "Can't guess: $enc";
my $utf8 = $enc->decode($data); 

print "$utf8\n";

Result: Can't guess: iso-8859-1 or utf8 at encodage-windows.pl line 25, line 18110.

use Encode::Guess qw/utf8 latin1/;

my $decoder = Encode::Guess->guess($data);
die $decoder unless ref($decoder);
my $utf8 = $decoder->decode($data);

print "$utf8\n";

Result: iso-8859-1 or utf8 at encodage-windows.pl line 30, line 18110.

use Encode::Guess qw/utf8 latin1/;

my $utf8 = Encode::decode("Guess", $data);

print "$utf8\n";

Result: iso-8859-1 or utf8 at /usr/local/lib/perl5/Encode.pm line 175.

My first question is: which one of these methods am I supposed to use (if any)? And my second question: what changes should I make to make this work?

kormak
  • 495
  • 2
  • 5
  • 15
  • Using Encode::Guess is overkill. See http://stackoverflow.com/a/22868803/589924 – ikegami Apr 11 '14 at 14:30
  • @ikegami Is it overkill in the case of utf-8 versus latin1, or overkill in general? It seemed more straightforward to use a module than to try to decode it, but I could be mistaken. – kormak Apr 11 '14 at 14:51
  • Text containing only ASCII characters (in the range 0..127) is valid ASCII, valid Latin-1, and valid UTF-8. – Keith Thompson Apr 11 '14 at 15:24
  • @kormak, No, that won't work with arbitrary encodings, just with encodings where you don't have to guess based on content. – ikegami Apr 11 '14 at 16:09

1 Answers1

6

I normally check the possible encodings one at a time, like this

my $decoder = guess_encoding($data, 'utf8');
$decoder = guess_encoding($data, 'iso-8859-1') unless ref $decoder;
die $decoder unless ref $decoder;

printf "Decoding as %s\n\n", $decoder->name;
$data = $decoder->decode($data);

If possible it chooses UTF-8, otherwise it tries ISO-8859-1, and either chooses that or errors, so it becomes a simple yes/no result for each encoding and there is no way for it to come up with two possible results (which is the error you're getting).

Borodin
  • 126,100
  • 9
  • 70
  • 144
  • Thanks, that worked perfectly! Your explanation makes things clearer for me now. :) – kormak Apr 11 '14 at 14:46
  • 1
    @kormak: I'm glad to help. You may want to be careful though, as this way it just uses UTF-8 if the encoding is ambiguous. That may not be the right thing to do in your situation, and perhaps there are some other cues you could check – Borodin Apr 11 '14 at 14:49