4

I am extracting strings from an XML file, and even though it should be pure UTF-8, it is not. My idea was to

#!/usr/bin/perl
use warnings;
use strict;
use Encode qw(decode encode);
use Data::Dumper;

my $x = "m\x{e6}gtig";
my $y = "m\x{c3}\x{a6}gtig";

my $a = encode('UTF-8', $x);
my $b = encode('UTF-8', $y);

print Dumper $x;
print Dumper $y;
print Dumper $a;
print Dumper $b;

if ($x eq $y) { print "1\n"; }
if ($x eq $a) { print "2\n"; }
if ($a eq $y) { print "3\n"; }
if ($a eq $b) { print "4\n"; }
if ($x eq $b) { print "5\n"; }
if ($y eq $b) { print "6\n"; }

outputs

$VAR1 = 'm�gtig';
$VAR1 = 'mægtig';
$VAR1 = 'mægtig';
$VAR1 = 'mægtig';
3

under the idea that only a latin1 string would increase its length, but encoding an already UTF-8 also makes it longer. So I can't detect latin1 vs UTF-8 that way.

Question

I would like to end up with always UTF-8 string, but how can I detect if it is latin1 or UTF-8, so I only convert the latin1 string?

Being able to get a yes/no if a string is UTF-8 would be just as useful.

Jasmine Lognnes
  • 6,597
  • 9
  • 38
  • 58

1 Answers1

10

Due to some properties of UTF-8, it's very unlikely that text encoded using iso-8859-1 would be valid UTF-8 unless it decodes identically using both encodings[1].

As such, the solution is to try decoding it using UTF-8. If it fails, decode it using iso-8859-1 instead. Since decoding using iso-8859-1 is a no-op, I'll be skipping that step.

  • utf8:: implementation:

    my $decoded_text = $utf8_or_latin1;
    utf8::decode($decoded_text);
    
  • Encode:: implementation:

    use Encode qw( decode_utf8 );
    
    my $decoded_text =
       eval { decode_utf8($utf8_or_latin1, Encode::FB_CROAK|Encode::LEAVE_SRC) }
          // $utf8_or_latin1;
    

Now, you say you want UTF-8. UTF-8 is obtained from encoding decoded text.

  • utf8:: implementation:

    my $utf8 = $decoded_text;
    utf8::encode($utf8);
    
  • Encode:: implementation:

    use Encode qw( encode_utf8 );
    
    my $utf8 = encode_utf8($decoded_text);
    

Notes

  1. Assuming the text is either valid UTF-8 or valid iso-8859-1, my solution would only guess wrong if all of the following are true:

    • The text is encoded using iso-8859-1 (as opposed to UTF-8),
    • At least one of [
      <80><81><82><83><84><85><86><87><88><89><8A><8B><8C><8D><8E><8F>
      <90><91><92><93><94><95><96><97><98><99><9A><9B><9C><9D><9E><9F>
      <NBSP>¡¢£¤¥¦§¨©ª«¬<SHY>®¯°±²³´µ¶·¸¹º»¼½¾¿
      ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖרÙÚÛÜÝÞß
      àáâãäåæçèéêëìíîïðñòóôõö÷
      ] is present,
    • All instances of [ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖרÙÚÛÜÝÞß] are followed by one of [
      <80><81><82><83><84><85><86><87><88><89><8A><8B><8C><8D><8E><8F>
      <90><91><92><93><94><95><96><97><98><99><9A><9B><9C><9D><9E><9F>
      <NBSP>¡¢£¤¥¦§¨©ª«¬<SHY>®¯°±²³´µ¶·¸¹º»¼½¾¿],
    • All instances of [àáâãäåæçèéêëìíîï] are followed by two of [
      <80><81><82><83><84><85><86><87><88><89><8A><8B><8C><8D><8E><8F>
      <90><91><92><93><94><95><96><97><98><99><9A><9B><9C><9D><9E><9F>
      <NBSP>¡¢£¤¥¦§¨©ª«¬<SHY>®¯°±²³´µ¶·¸¹º»¼½¾¿],
    • All instances of [ðñòóôõö÷] are followed by three of [
      <80><81><82><83><84><85><86><87><88><89><8A><8B><8C><8D><8E><8F>
      <90><91><92><93><94><95><96><97><98><99><9A><9B><9C><9D><9E><9F>
      <NBSP>¡¢£¤¥¦§¨©ª«¬<SHY>®¯°±²³´µ¶·¸¹º»¼½¾¿],
    • None of [øùúûüýþÿ] are present, and
    • None of [
      <80><81><82><83><84><85><86><87><88><89><8A><8B><8C><8D><8E><8F>
      <90><91><92><93><94><95><96><97><98><99><9A><9B><9C><9D><9E><9F>
      <NBSP>¡¢£¤¥¦§¨©ª«¬<SHY>®¯°±²³´µ¶·¸¹º»¼½¾¿
      ] are present except where previously mentioned.

    (<80>..<9F> are unassigned or unprintable control characters, not sure which.)

    In other words, that code is very reliable.

ikegami
  • 367,544
  • 15
  • 269
  • 518
  • How come encoding utf8 to utf8 doesn't trash it? It does in my OP. – Jasmine Lognnes Apr 04 '14 at 17:24
  • In your example it works, which i don't understand why it does, as it fails in my. – Jasmine Lognnes Apr 04 '14 at 17:32
  • 2
    I don't encode UTF-8 bytes using UTF-8. I don't encode UTF-8 bytes, period. I encode decoded text (Unicode code points) using UTF-8. – ikegami Apr 04 '14 at 17:40
  • @ikegami, hi, could you please elaborate on why decoding using iso-8859-1 is a no-op? Why not simply add the following to you Encode:: implementation: `my $decoded_text = eval { ... } // decode ("iso-8859-1", $utf8_or_latin1);`? Thanks – n.r. Aug 07 '22 at 12:27
  • 1
    @n.r. Re "*why decoding using iso-8859-1 is a no-op?*", Because Unicode is an extension of iso-8851-1. Specifically, iso-8859-1 0 is Code Point 0, 1 is 1, 2 is 2, ..., and FF is FF. – ikegami Aug 07 '22 at 16:14