Perl - File Encoding and Word Comparison

Question

I have a file with one phrase/terms each line which i read to perl from STDIN. I have a list of stopwords (like "á", "são", "é") and i want to compare each one of them with each term, and remove if they are equal. The problem is that i'm not certain of the file's encoding format.

I get this from the file command:

words.txt: Non-ISO extended-ASCII English text

My linux terminal is in UTF-8 and it shows the right content for some words and for others don't. Here is the output from some of them:

condi<E3>
conte<FA>dos
ajuda, mas não resolve
mo<E7>ambique
pedagógico são fenómenos

You can see that the 3rd and 5th lines are correctly identifying words with accents and special characters while others don't. The correct output for the other lines should be: condiã, conteúdos and moçambique.

If i use binmode(STDOUT, utf8) the "incorrect" lines now output correctly while the other ones don't. For example the 3rd line:

ajuda, mas nÃ£o resolve

What should i do guys?

score 4 · Answer 1 · edited May 05 '11 at 19:47

I strongly suggest you create a filter that takes a file with lines in mixed encodings and translates them to pure UTF-8. Then instead

open(INPUT, "< badstuff.txt") || die "open failed: $!";

you would open either the fixed version, or a pipe from the fixer, like:

open(INPUT, "fixit < badstuff.txt |") || die "open failed: $!"

In either event, you would then

binmode(INPUT, ":encoding(UTF-8)") || die "binmode failed";

Then the fixit program could just do this:

use strict;
use warnings;
use Encode qw(decode FB_CROAK);

binmode(STDIN,  ":raw")  || die "can't binmode STDIN";
binmode(STDOUT, ":utf8") || die "can't binmode STDOUT";

while (my $line = <STDIN>) {
    $line = eval { decode("UTF-8", $line, FB_CROAK() };
    if ($@) { 
        $line = decode("CP1252", $line, FB_CROAK()); # no eval{}!
    }
    $line =~ s/\R\z/\n/;  # fix raw mode reads
    print STDOUT $line;    
}

close(STDIN)  || die "can't close STDIN: $!";
close(STDOUT) || die "can't close STDOUT: $!";
exit 0;

See how that works? Of course, you could change it to default to some other encoding, or have multiple fall backs. Probably it would be best to take a list of them in @ARGV.

Very good point to decode from a specific encoding when decoding from UTF-8 fails. So you don't end up with a mixture of Unicode and legacy strings, but homogenize everything to Unicode. — Lumi, May 05 '11 at 21:39

Lumi · Accepted Answer · 2011-05-05T22:50:58.963

3

It works like this:

C:\Dev\Perl :: chcp
Aktive Codepage: 1252.

C:\Dev\Perl :: type mixed-encoding.txt
eins zwei drei KÃ¤se vier fÃ¼nf Wurst
eins zwei drei Käse vier fünf Wurst

C:\Dev\Perl :: perl mixed-encoding.pl < mixed-encoding.txt
eins zwei drei vier fünf
eins zwei drei vier fünf

Where mixed-encoding.pl goes like this:

use strict;
use warnings;
use utf8; # source in UTF-8
use Encode 'decode_utf8';
use List::MoreUtils 'any';

my @stopwords = qw( Käse Wurst );

while ( <> ) { # read octets
    chomp;
    my @tokens;
    for ( split /\s+/ ) {
        # Try UTF-8 first. If that fails, assume legacy Latin-1.
        my $token = eval { decode_utf8 $_, Encode::FB_CROAK };
        $token = $_ if $@;
        push @tokens, $token unless any { $token eq $_ } @stopwords;
    }
    print "@tokens\n";
}

Note that the script doesn't have to be encoded in UTF-8. It's just that if you have funky character data in your script you have to make sure the encoding matches, so use utf8 if your encoding is UTF-8, and don't if it isn't.

Update based on tchrist's sound advice:

use strict;
use warnings;
# source in Latin1
use Encode 'decode';
use List::MoreUtils 'any';

my @stopwords = qw( Käse Wurst );

while ( <> ) { # read octets
        chomp;
        my @tokens;
        for ( split /\s+/ ) {
                # Try UTF-8 first. If that fails, assume 8-bit encoding.
                my $token = eval { decode utf8 => $_, Encode::FB_CROAK };
                $token    = decode Windows1252 => $_, Encode::FB_CROAK if $@;
                push @tokens, uc $token unless any { $token eq $_ } @stopwords;
        }
        print "@tokens\n";
}

edited May 05 '11 at 22:50

answered May 05 '11 at 18:21

Lumi

14,775
8
59
92

@michael Thanks now it's outputting correctly ;) I realized that the majority of the file is in ISO-8859-1 and some parts in utf-8 (that's why some of them were outputting correctly) One more thing. I have to use the `lc` function because my stopwords are all lower-cased, and i'm having problems when the phrases are not utf-8. In this situations if i have an Upper-case letter with an accent it won't be lower cased. – Barata May 05 '11 at 18:58
2

@Barata: You still have to decode the non-UTF8 strings if you want `uc` etc to work on them. The Perl 5.12 (and above) `unicode_strings` feature may also help, in that it will assume ISO 8859-1 for byte strings. Compare: `perl -e 'print uc("\xB5\xE9\xDF")'` => `µéß`, **which is wrong,** with `perl -M5.012 -e 'print uc("\xB5\xE9\xDF")'` => `ΜÉSS` **which is right.** The last string is really `"\x{39C}\x{C9}SS"` or `"\N{GREEK CAPITAL LETTER MU}\N{LATIN CAPITAL LETTER E WITH ACUTE}SS"`. The original string is `"\N{MICRO SIGN}\N{LATIN SMALL LETTER E WITH ACUTE}\N{LATIN SMALL LETTER SHARP S}"`. – tchrist May 05 '11 at 19:09
@tchrist Using Michael code, checking `if $@` and decoding the string to iso-8859-1 is enough? – Barata May 05 '11 at 19:18
@Barata: Yes, probably. But if you are processing a file that came from a Microsoft system, you probably should assume CP1252, which is a superset of ISO-8859-1. See my solution. – tchrist May 05 '11 at 19:24
You can’t actually do it this way for general files of mixed encoding. That’s because you won’t even find whitespace correctly if you don’t know the encoding, let alone read lines in one at a time. Imagine what happens if parts are in UTF-16, for example. – tchrist May 05 '11 at 22:42
Thanks, I'm croaking all over the place now. - You're certainly correct about this not working with arbitrary input. Chances are, however, that the OP's situation is mostly a UTF-8/Latin1 salad. So it does the job. – Lumi May 05 '11 at 22:54

Perl - File Encoding and Word Comparison

2 Answers2