0

When I use functions like toupper() in AWK, they are automatically locale-aware and process text in the user's current locale.

I would like to do the same in a Perl script, but have failed so far.

For this, I wrote the following ASCII shell script for testing Perl and AWK:

$ unexpand -t 2 << 'END_SCRIPT' | tee case3 && chmod +x case3
#! /bin/sh
{
  iconv -cf UTF-7 \
  | case $1 in
  awk)
    awk '{
      print "original", $0
      print "to lower", tolower($0)
      print "to upper", toupper($0)
    }'
    ;;
  perl)
    perl -e '
      use locale;
      while (defined($_= <>)) {
        print "original ", $_;
        print "to lower ", lc;
        print "to upper ", uc;
      }
    '
  esac \
  | iconv -ct UTF-7 | iconv -cf UTF-7
} << 'EOF'
+AMQ-gypten
S+APw-d
+APY-stlich
EOF
END_SCRIPT

Note the iconv UTF-7 stuff at the end of the script: This is just there to drop any characters from the output that the current locale cannot represent.

Here is the output when I run the script for testing AWK:

$ ./case3 awk
original Ägypten
to lower ägypten
to upper ÄGYPTEN
original Süd
to lower süd
to upper SÜD
original östlich
to lower östlich
to upper ÖSTLICH

This looks quite good and how it should be.

Now the same for Perl:

$ ./case3 perl
original Ägypten
to lower gypten
to upper ÄGYPTEN
original Süd
to lower sd
to upper SüD
original östlich
to lower stlich
to upper öSTLICH

Obviously, this produces different output and works just not right.

I would appreciate to know what I made wrong in the "perl"-case of the script.

Note: I do not want my script to require a UTF-8 locale, it should work with any locale which can represent the German Umlauts used in my test.txt file.

In case you should be curious, the above results were generated with the following locale settings:

$ locale
LANG=de_AT.UTF-8
LANGUAGE=de_AT.UTF-8:de.UTF-8:en_US.UTF-8:de_AT:de:en_US:en
LC_CTYPE="de_AT.UTF-8"
LC_NUMERIC="de_AT.UTF-8"
LC_TIME="de_AT.UTF-8"
LC_COLLATE="de_AT.UTF-8"
LC_MONETARY="de_AT.UTF-8"
LC_MESSAGES="de_AT.UTF-8"
LC_PAPER="de_AT.UTF-8"
LC_NAME="de_AT.UTF-8"
LC_ADDRESS="de_AT.UTF-8"
LC_TELEPHONE="de_AT.UTF-8"
LC_MEASUREMENT="de_AT.UTF-8"
LC_IDENTIFICATION="de_AT.UTF-8"
LC_ALL=
  • What's your version of Perl? "*Unfortunately, there are quite a few deficiencies with the design (and often, the implementations) of locales. Unicode was invented (see perlunitut for an introduction to that) in part to address these design deficiencies, and nowadays, there is a series of "UTF-8 locales", based on Unicode. These are locales whose character set is Unicode, encoded in UTF-8. Starting in v5.20, Perl fully supports UTF-8 locales, except for sorting and string comparisons like `lt` and `ge`.* – ikegami Feb 11 '19 at 06:39
  • "*Starting in v5.26, Perl can handle these reasonably as well, depending on the platform's implementation. However, for earlier releases or for better control, use Unicode::Collate. Perl continues to support the old non UTF-8 locales as well. There are currently no UTF-8 locales for EBCDIC platforms.*" – ikegami Feb 11 '19 at 06:39
  • @ikegami I have perl 5, version 24, subversion 1 (v5.24.1) built for i686-linux-gnu-thread-multi-64int - this is the current Debian-9 version. – Guenther Brunthaler Feb 11 '19 at 06:48
  • @ikegami Can you give me a hint how Unicode::Collate could be integrated within my above script? As far as I noticed from a quick glance, Unicode::Collate does not care about the current locale. And I do not want my script to require a UNICODE locale either. – Guenther Brunthaler Feb 11 '19 at 06:53
  • I wasn't suggesting you use U::C; I was just pointing out the issues with different versions of Perl. If your trying to sort (rather than change case), [Unicode::Collate::Locale](https://metacpan.org/pod/Unicode::Collate::Locale) would be more up your alley. – ikegami Feb 11 '19 at 06:55
  • @ikegami Do I understand you correctly then, that "use locale" is buggy in Perl versions before 5.26, and just does not work correctly with all locales, not even with UTF-8 based locales? – Guenther Brunthaler Feb 11 '19 at 06:59
  • Well, according to the passage, it should have word with 5.20+. But yeah, UTF-8 locales were largely ignored by Perl for a while (because they were so broken in so many distros, as I understand things). – ikegami Feb 11 '19 at 07:00
  • oo, Turns out my host does have `de_AT.UTF-8`. Testing... – ikegami Feb 11 '19 at 07:03
  • I wonder if I might have to do more than just "use locale" in the Perl script? But as far as I have understood the documentation, this should not be the case provided the locale has been set correctly. Which seems to be the case, or the AWK could not work correctly either. – Guenther Brunthaler Feb 11 '19 at 07:04
  • `LC_ALL=de_AT.UTF-8 LANG=de_AT.UTF-8 LANGUAGE=de_AT.UTF-8 perl -le'use locale; $_ = "Ägypten"; printf "%vX\n", $_; $_ = lc($_); printf "%vX\n", $_;'` converts `C3.84` into `E3.84` instead of `C3.A4`, suggesting the input should be decoded before being passed to `lc`. But while that seems required for UTF-8 locales, my understanding of perllocale leads me to believe that's bad for non-UTF-8 locales because it can lead to the following situation: ("Greek locale" refers to ISO8859-7) – ikegami Feb 11 '19 at 07:38
  • "*Still another problem is that this approach can lead to two code points meaning the same character. Thus in a Greek locale, both U+03A7 and U+00D7 are GREEK CAPITAL LETTER CHI. Because of all these problems, starting in v5.22, Perl will raise a warning if a multi-byte (hence Unicode) code point is used when a single-byte locale is in effect. (Although it doesn't check for this if doing so would unreasonably slow execution down.)*" – ikegami Feb 11 '19 at 07:38

1 Answers1

3

This is not quite what you asked since it determines casing based on Unicode rules instead of the locale's rules, but it will work for all locales (UTF-8 and otherwise):

use open ':std', ':locale';
while (<>) {
    print "original ", $_;
    print "to lower ", lc;
    print "to upper ", uc;
}
ikegami
  • 367,544
  • 15
  • 269
  • 518