Proper handing of UTF-8 in Perl

Question

I have been given a file, (probably) encoded in Latin-1 (ISO 8859-1), and there are some conversions and data mining to be done with it. The output is supposed to be in UTF-8, and I have tried about anything I could find about encoding conversion in Perl, none of them produced any usable output.

I know that use utf8; does nothing to begin with. I have tried the Encode package, which looked promising:

open FILE, '<', $ARGV[0] or die $!;

my %tmp = ();
my $last_num = 0;

while (<FILE>) {
    $_ = decode('ISO-8859-1', encode('UTF-8', $_));

    chomp;
    next unless length;
    process($_);
}

I tried that in any combination I could think of, also thrown in a binmode(STDOUT, ":utf8");, open FILE, '<:encoding(ISO-8859-1)', $ARGV[0] or die $!; and much more. The result were either scrambled umlauts, or an error message like \xC3 is not a valid UTF-8 character, or even mixed text (Some in UTF-8, some in Latin-1).

All I wanna have is a simple way to read in a Latin-1 text file and produce UTF-8 output on the console via print. Is there any simple way to do that in Perl?

Perl don't know, how to work with utf normally :( – gaussblurinc Aug 03 '12 at 09:28 — gaussblurinc, Aug 03 '12 at 09:28

daxim · Accepted Answer · 2012-08-04T20:29:14.680

6

See Perl encoding introduction and the Unicode cookbook.

Easiest with piconv:

$ piconv -f Latin1 -t UTF-8 < input.file > output.file

Easy, with encoding layers:

use autodie qw(:all);
open my $input, '<:encoding(Latin1)', $ARGV[0];
binmode STDOUT, ':encoding(UTF-8)';

Moderately, with manual de-/encoding:

use Encode qw(decode encode);
use autodie qw(:all);

open my $input, '<:raw', $ARGV[0];
binmode STDOUT, ':raw';
while (my $raw = <$input>) {
    my $line = decode 'Latin1', $raw, Encode::FB_CROAK | Encode::LEAVE_SRC;
    my $result = process($line);
    print {STDOUT} encode 'UTF-8', $result, Encode::FB_CROAK | Encode::LEAVE_SRC;
}

edited Aug 04 '12 at 20:29

answered Aug 03 '12 at 09:31

daxim

39,270
4
65
132

The only problem you'll have with daxim's approach is if the file is not in fact in Latin1 - files in a mix of encodings are a nightmare to deal with no matter what you do unfortunately. – Richard Huxton Aug 03 '12 at 09:44
@RichardHuxton Is there any chance of dealing with those? I suspect some of the data I have been given of mixed encodings. – Lanbo Aug 03 '12 at 15:17
1

There's Encode::Guess, but I'm afraid it's almost impossible to tell many of the 8-bit character sets apart without knowing ahead of time what the content is. For example 8859-15 has the Euro symbol so financial information with lots of codepoint 0xA4 are probably that rather than 8859-1. Likewise some Welsh accented characters are in 8859-14. Without knowing what the text means though, it's very hard work. That's without getting on to Microsoft-Word "smart quotes" cropping up where people have cut+pasted from Word. – Richard Huxton Aug 03 '12 at 15:36
3

If you are going to decode things yourself, you'd better be sure you're reading the raw byte stream. In this case, you've left the default decoding to whatever that file read decides to do, which can be influenced from far away. The same thing goes for the output. You have to be sure that there's not something set up on STDOUT to encode what you are giving it. – brian d foy Aug 03 '12 at 16:42

score 5 · Answer 2 · answered Aug 03 '12 at 08:50

Maybe as :

$_ = encode('utf-8', decode('ISO-8859-1', $_));

The Data is gb2312 encode, so this can convert it to utf-8:

#!/usr/bin/env perl

use Encode qw(encode decode);

while (<DATA>) {
    $_ = encode('utf-8', decode('gb2312', $_));
    print;
}

__DATA__
Â×¶Ø°ÂÔË»á

score 3 · Answer 3 · answered Aug 03 '12 at 10:39

$_ = decode('ISO-8859-1', encode('UTF-8', $_));

This line has two problems with it. Firstly you are encoding your input to UTF-8 and then decoding it from ISO-8859-1. These two operations are the wrong way round.

Secondly, you almost certainly don't want to decode and encode at the same time. The Golden Rule of handling character encodings in Perl is to follow this process:

Decode data as soon as you get it from the outside world. This takes your input bytestream and converts it into Perl's internal representation for character strings.
Process the data according to your requirements.
Encode the data just before sending it to the outside world. This takes Perl's internal representation for character strings and converts it to a correctly-encoded bytestream for your required output encoding.

Proper handing of UTF-8 in Perl

3 Answers3