4

In the Date::Holidays::DK module, the names of certain Danish holidays are written in Latin1 encoding. For example, January 1st is 'Nytårsdag'. What should I do to $x below in order to get a proper utf8-encoded string?

use Date::Holidays::DK;
my $x = is_dk_holiday(2011,1,1);

I tried various combinations of use utf8 and no utf8 before/after use Date::Holidays::DK, but it does not seem to have any effect. I also triede to use Encode's decode, with no luck. More specifically,

use Date::Holidays::DK;
use Encode;
use Devel::Peek;
my $x = decode("iso-8859-1", 
           is_dk_holiday(2011,1,1)
          );
Dump($x);
print "January 1st is '$x'\n";

gives the output

SV = PV(0x15eabe8) at 0x1492a10
  REFCNT = 1
  FLAGS = (PADMY,POK,pPOK,UTF8)
  PV = 0x1593710 "Nyt\303\245rsdag"\0 [UTF8 "Nyt\x{e5}rsdag"]
  CUR = 10
  LEN = 16
January 1st is 'Nyt sdag'

(with an invalid character between t and s).

Villemoes
  • 216
  • 1
  • 9

2 Answers2

4

use utf8 and no utf8 before/after use Date::Holidays::DK, but it does not seem to have any effect.

Correct. The utf8 pragma only indicates that the source code of the program is written in UTF-8.

I also tried to use Encode's decode, with no luck.

You did not perceive this correctly, you in fact did the right thing. You now have a string of Perl characters and can manipulate it.

with an invalid character between t and s

You also interpret this wrong, it is in fact the å character.


You want to output UTF-8, so you are lacking the encoding step.

my $octets = encode 'UTF-8', $x;
print $octets;

Please read http://p3rl.org/UNI for the introduction to the topic of encoding. You always must decode and encode, either explicitely or implicitely.

daxim
  • 39,270
  • 4
  • 65
  • 132
2

use utf8 only is a hint to the perl interpreter/compiler that your file is UTF-8 encoded. If you have strings with high-bit set, it will automatically encode them to unicode.

If you have a variable that is encoded in iso-8859-1 you must decode it. Then your variable is in the internal unicode format. That's utf8 but you shouldn't care which encoding perl uses internaly.

Now if you want to print such a string you need to convert the unicode string back to a byte string. You need to do a encode on this string. If you don't do an encode manually perl itself will encode it back to iso-8859-1. This is the default encoding.

Before you print your variable $x, you need to do a $x = encode('UTF-8', $x) on it.

For correct handling of UTF-8 you always need to decode() every external input over I/O. And you always need to encode() everything that leaves your program.

To change the default input/output encoding you can use something like this.

use utf8;
use open ':encoding(UTF-8)';
use open ':std';

The first line says that your source code is encoded in utf8. The second line says that every input/ouput should automatically encode in utf8. It is important to notice that a open() also open a file in utf8 mode. If you work with binary files you need to call a binmode() on the handle.

But the second line does not change handling of STDIN,STDOUT or STDERR. The third line will change that.

You can probably use the modul utf8:all that makes this process easier. But it is always good to understand how all this works behind the scenes.

To correct your example. One possible way is this:

#!/usr/bin/env perl
use Date::Holidays::DK;
use Encode;
use Devel::Peek;
my $x = decode("iso-8859-1", 
           is_dk_holiday(2011,1,1)
          );
Dump($x);
print encode("UTF-8", "January 1st is '$x'\n");
David Raab
  • 4,433
  • 22
  • 40
  • 1
    I wish you're remove the paragraph about is_utf8. – daxim Jul 14 '11 at 15:09
  • Do you knew a better way to check if a string is internally encoded in unicode? Then i will replace it. – David Raab Jul 14 '11 at 15:12
  • 1
    ITYM to say "internally encoded in UTF-8 *encoding*" because something *encoded in a character set* such as Unicode does not make any sense. To answer: You shouldn't care, and the SvUTF8 flag or its absence cannot tell you (that's what is_utf8 actually checks). The programmer must only keep track of: Have I already decoded incoming octets? Have I already encoded outgoing character data? How Perl internally encodes character data is its own business (it's more complicated than you realise), and you are not supposed to mess with the functions from the utf8 module. Its documentation says so. – daxim Jul 14 '11 at 15:39
  • If you want to write a module that handles unicode string correctly and talk to the outside world, then you need to knew if a string is encoded in unicode or not (yes unicode is not an encoding and internally it is utf-8, but a user should not care what the internal representation is, the user should only care if it is unicode or not). But sure you also can't care about unicode and let the user that uses your module handle it by himself, but i don't like it. Perl has unicode strings and a modul author should consider them. I'm always open for a better way. "Don't do it" is not a better way. – David Raab Jul 14 '11 at 16:01
  • 3
    Sorry, but that not true at all. `is_utf8` does not indicate whether something needs to be encoded. In fact, Perl has no way of knowing whether a string needs to be encoded or not. If it did, it could do it itself. (I'd debunk your claims in detail, but this box is really not appropriate for explaining anything.) As for what to do instead, you should decode everything on input and encode everything on output. If you want to deal with both encoded and decoded strings, you'll need to manually keep track of which is which. – ikegami Jul 14 '11 at 16:44
  • Because you both complain, i will remove the sentence. And create a question from it a little bit later (tommorow). – David Raab Jul 14 '11 at 16:55
  • @Sid Burn, For example, In `$name = "\x{C9}ric";`, `$name` contains a *text* string of an alternate spelling of my name. Since it's text, it needs to be encoded. In `$control_seq = "\xC9\x72\x69\x63";`, `$control_seq` contains a *byte* string for controlling some device. Since it's bytes, it must not be encoded. Both strings are indistinguishable. – ikegami Jul 14 '11 at 19:26
  • If you don't decode it, then it is not text, then it is just bytes. Your assumption is that "\x{c9}ric" is in ISO-8859-1 encoding. If that is the case you need to decode it. If you do that, then on the resulting variable you get an unicode string, and `utf8::is_utf8()` returns true. If `utf8::is_utf8()` returns true you need to encode() it before printing. For example "\x{c9}ric" will print some invalid string on an UTF-8 Terminal. The correct way to print it is `print encode("UTF-8", decode("iso-8859-1", "\x{c9}ric"))`. Without decoding your string is just a byte string. – David Raab Jul 14 '11 at 20:06
  • The problem is that if you use for example HTML::Entities to decode ascii text with ' ', you'll get a character string internally encoded in Latin1 and without the UTF8 flag. If you relied on what the is_utf8 predicate tells you you'd think it is bytes - but it is not and if you tried to decode it you'd get a crash. See [my blogpost for more details](http://perlalchemy.blogspot.com/2011/08/isutf8-is-useless-can-we-have.html) – zby Aug 29 '11 at 13:20
  • I gave new comments to your blog. And also Latin1 is "just bytes". Text only exists in the human mind. The computer only sees bytes. As long as you don't decode() it, for perl it is only bytes and you need to knew which encoding it is, or if it is text or an image. – David Raab Aug 31 '11 at 14:43
  • Sid - nobody disputes that - the problem was with using is_utf8 to answer the question if you need to decode() or not. – zby Sep 01 '11 at 11:16
  • if is_utf8() return true you don't need to decode() something. It already is decoded. – David Raab Sep 01 '11 at 20:37