How to detect latin1 and UTF-8?

Question

I am extracting strings from an XML file, and even though it should be pure UTF-8, it is not. My idea was to

#!/usr/bin/perl
use warnings;
use strict;
use Encode qw(decode encode);
use Data::Dumper;

my $x = "m\x{e6}gtig";
my $y = "m\x{c3}\x{a6}gtig";

my $a = encode('UTF-8', $x);
my $b = encode('UTF-8', $y);

print Dumper $x;
print Dumper $y;
print Dumper $a;
print Dumper $b;

if ($x eq $y) { print "1\n"; }
if ($x eq $a) { print "2\n"; }
if ($a eq $y) { print "3\n"; }
if ($a eq $b) { print "4\n"; }
if ($x eq $b) { print "5\n"; }
if ($y eq $b) { print "6\n"; }

outputs

$VAR1 = 'm�gtig';
$VAR1 = 'mægtig';
$VAR1 = 'mægtig';
$VAR1 = 'mÃ¦gtig';
3

under the idea that only a latin1 string would increase its length, but encoding an already UTF-8 also makes it longer. So I can't detect latin1 vs UTF-8 that way.

Question

I would like to end up with always UTF-8 string, but how can I detect if it is latin1 or UTF-8, so I only convert the latin1 string?

Being able to get a yes/no if a string is UTF-8 would be just as useful.

Do you want a solution to guess what´s the correct charset or do you want something accurate? Bacause, latter is not possible. — deviantfan, Apr 04 '14 at 16:36
If it is not possible to do it accurately, then guessing it better than nothing =) — Jasmine Lognnes, Apr 04 '14 at 16:37
@deviantfan, Guessing is very accurate. See the footnote in my answer. — ikegami, Apr 04 '14 at 18:07
@ikegami: It´s still guessing. I´m not saying this is bad, but that won´t change the fact. — deviantfan, Apr 04 '14 at 19:14
@deviantfan, You seem to have misread something. I never said it wasn't guessing. — ikegami, Apr 04 '14 at 19:16
@ikegami: I´m not pretending anything? I didn´t meant it in any bad way, if you understood it so. — deviantfan, Apr 04 '14 at 19:16
Can't you avoid all this by going back to whoever is supplying you with this data and asking them to provide valid UTF8? — Dave Cross, Apr 05 '14 at 08:46

ikegami · Accepted Answer · 2014-04-04T19:44:11.517

Due to some properties of UTF-8, it's very unlikely that text encoded using iso-8859-1 would be valid UTF-8 unless it decodes identically using both encodings^[1].

As such, the solution is to try decoding it using UTF-8. If it fails, decode it using iso-8859-1 instead. Since decoding using iso-8859-1 is a no-op, I'll be skipping that step.

utf8:: implementation:

my $decoded_text = $utf8_or_latin1;
utf8::decode($decoded_text);

Encode:: implementation:

use Encode qw( decode_utf8 );

my $decoded_text =
   eval { decode_utf8($utf8_or_latin1, Encode::FB_CROAK|Encode::LEAVE_SRC) }
      // $utf8_or_latin1;

Now, you say you want UTF-8. UTF-8 is obtained from encoding decoded text.

utf8:: implementation:

my $utf8 = $decoded_text;
utf8::encode($utf8);

Encode:: implementation:

use Encode qw( encode_utf8 );

my $utf8 = encode_utf8($decoded_text);

Notes

Assuming the text is either valid UTF-8 or valid iso-8859-1, my solution would only guess wrong if all of the following are true:
- The text is encoded using iso-8859-1 (as opposed to UTF-8),
- At least one of [
  <80><81><82><83><84><85><86><87><88><89><8A><8B><8C><8D><8E><8F>
  <90><91><92><93><94><95><96><97><98><99><9A><9B><9C><9D><9E><9F>
  <NBSP>¡¢£¤¥¦§¨©ª«¬<SHY>®¯°±²³´µ¶·¸¹º»¼½¾¿
  ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞß
  àáâãäåæçèéêëìíîïðñòóôõö÷
  ] is present,
- All instances of [ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞß] are followed by one of [
  <80><81><82><83><84><85><86><87><88><89><8A><8B><8C><8D><8E><8F>
  <90><91><92><93><94><95><96><97><98><99><9A><9B><9C><9D><9E><9F>
  <NBSP>¡¢£¤¥¦§¨©ª«¬<SHY>®¯°±²³´µ¶·¸¹º»¼½¾¿],
- All instances of [àáâãäåæçèéêëìíîï] are followed by two of [
  <80><81><82><83><84><85><86><87><88><89><8A><8B><8C><8D><8E><8F>
  <90><91><92><93><94><95><96><97><98><99><9A><9B><9C><9D><9E><9F>
  <NBSP>¡¢£¤¥¦§¨©ª«¬<SHY>®¯°±²³´µ¶·¸¹º»¼½¾¿],
- All instances of [ðñòóôõö÷] are followed by three of [
  <80><81><82><83><84><85><86><87><88><89><8A><8B><8C><8D><8E><8F>
  <90><91><92><93><94><95><96><97><98><99><9A><9B><9C><9D><9E><9F>
  <NBSP>¡¢£¤¥¦§¨©ª«¬<SHY>®¯°±²³´µ¶·¸¹º»¼½¾¿],
- None of [øùúûüýþÿ] are present, and
- None of [
  <80><81><82><83><84><85><86><87><88><89><8A><8B><8C><8D><8E><8F>
  <90><91><92><93><94><95><96><97><98><99><9A><9B><9C><9D><9E><9F>
  <NBSP>¡¢£¤¥¦§¨©ª«¬<SHY>®¯°±²³´µ¶·¸¹º»¼½¾¿
  ] are present except where previously mentioned.
(<80>..<9F> are unassigned or unprintable control characters, not sure which.)

In other words, that code is very reliable.

How come encoding utf8 to utf8 doesn't trash it? It does in my OP. — Jasmine Lognnes, Apr 04 '14 at 17:24
In your example it works, which i don't understand why it does, as it fails in my. — Jasmine Lognnes, Apr 04 '14 at 17:32
I don't encode UTF-8 bytes using UTF-8. I don't encode UTF-8 bytes, period. I encode decoded text (Unicode code points) using UTF-8. — ikegami, Apr 04 '14 at 17:40
@ikegami, hi, could you please elaborate on why decoding using iso-8859-1 is a no-op? Why not simply add the following to you Encode:: implementation: `my $decoded_text = eval { ... } // decode ("iso-8859-1", $utf8_or_latin1);`? Thanks — n.r., Aug 07 '22 at 12:27
@n.r. Re "*why decoding using iso-8859-1 is a no-op?*", Because Unicode is an extension of iso-8851-1. Specifically, iso-8859-1 0 is Code Point 0, 1 is 1, 2 is 2, ..., and FF is FF. — ikegami, Aug 07 '22 at 16:14

How to detect latin1 and UTF-8?

1 Answers1

Linked