10

I'm (belatedly) testing Unicode waters for the first time and am failing to understand why the process of encoding, then decoding an Arabic string is having the effect of separating out the individual characters that the word is made of.

In the example below, the word "ﻟﻠﺒﻴﻊ" comprises of 5 individual letters: "ع","ي","ب","ل","ل", written right to left. Depending on the surrounding context (adjacent letters), the letters change form


use strict;
use warnings;
use utf8;

binmode( STDOUT, ':utf8' );

use Encode qw< encode decode >;

my $str = 'ﻟﻠﺒﻴﻊ';                 # "For sale" 
my $enc = encode( 'UTF-8', $str );
my $dec = decode( 'UTF-8', $enc );

my $decoded = pack 'U0W*', map +ord, split //, $enc;

print "Original string : $str\n";     #  ل ل ب ي ع   
print "Decoded string 1: $dec\n"      #  ل ل ب ي ع
print "Decoded string 2: $decoded\n"; #  ل ل ب ي ع

ADDITIONAL INFO

  • When pasting the string to this post, the rendering is reversed so it looks like "ﻊﻴﺒﻠﻟ". I'm reversing it manually to get it to look 'right'. The correct hexdump is given below:

    $ echo "ﻟﻠﺒﻴﻊ" | hexdump
    0000000 bbef ef8a b4bb baef ef92 a0bb bbef 0a9f
    0000010
    
  • The output of the Perl script (per ikegami's request):

    $ perl unicode.pl | od -t x1
    0000000 4f 72 69 67 69 6e 61 6c 20 73 74 72 69 6e 67 20
    0000020 3a 20 d8 b9 d9 8a d8 a8 d9 84 d9 84 0a 44 65 63
    0000040 6f 64 65 64 20 73 74 72 69 6e 67 20 31 3a 20 d8
    0000060 b9 d9 8a d8 a8 d9 84 d9 84 0a 44 65 63 6f 64 65
    0000100 64 20 73 74 72 69 6e 67 20 32 3a 20 d8 b9 d9 8a
    0000120 d8 a8 d9 84 d9 84 0a
    0000127
    

    And if I just print $str:

    $ perl unicode.pl | od -t x1
    0000000 4f 72 69 67 69 6e 61 6c 20 73 74 72 69 6e 67 20
    0000020 3a 20 d8 b9 d9 8a d8 a8 d9 84 d9 84 0a
    0000035
    

    Finally (per ikegami's request):

    $ grep 'For sale' unicode.pl | od -t x1
    0000000 6d 79 20 24 73 74 72 20 3d 20 27 d8 b9 d9 8a d8
    0000020 a8 d9 84 d9 84 27 3b 20 20 23 20 22 46 6f 72 20
    0000040 73 61 6c 65 22 20 0a
    0000047
    
  • Perl details

    $ perl -v
    
    This is perl, v5.10.1 (*) built for x86_64-linux-gnu-thread-multi
    (with 53 registered patches, see perl -V for more detail)
    
  • Outputting to file reverses the string: "ﻊﻴﺒﻠﻟ"


QUESTIONS

I have several:

  • How can I preserve the context of each character while printing?

  • Why is the original string printed out to screen as individual letters, even though it hasn't been 'processed'?

  • When printing to file, the word is reversed (I'm guessing this is due to the script's right-to-left nature). Is there a way I can prevent this from happening?

  • Why does the following not hold true: $str !~ /\P{Bidi_Class: Right_To_Left}/;

Zaid
  • 36,680
  • 16
  • 86
  • 155
  • It's likely that monospace characters don't merge together like they do normally. – Niet the Dark Absol Jan 30 '13 at 20:38
  • @Kolink : In my terminal, `echo "ﻟﻠﺒﻴﻊ"` happily returns `ﻟﻠﺒﻴﻊ` – Zaid Jan 30 '13 at 20:40
  • Is "monospace character" a thing now? – Kerrek SB Jan 30 '13 at 20:42
  • As another point of information - the literal string is reversed when I cut and paste it into a terminal (LANG=en_GB.UTF-8). After that it looks the same with all 3 print statements. In the comment the characters are reversed too. There are "spaces" between the characters in the comment but not in the source or the print-out. – Richard Huxton Jan 30 '13 at 21:23
  • Sadly, my terminal (`gnome-terminal` in `en_US.utf-8`) displays the characters left-to-right in each case, never correctly; but it is identical, each time. Strangely, it does have them in proper ligature form (e.g. the ل have a flat bottom, but are at the far left, the z is in its final form). I wonder, however, if printing the Unicode directionality bytes to your output file might help? – BRPocock Jan 30 '13 at 22:37
  • The example code above works for me on Perl v5.14.2 (after fixing the missing semicolon), and outputs "ﻟﻠﺒﻴﻊ" (`ef bb 9f ef bb a0 ef ba 92 ef bb b4 ef bb 8a` in hex) on all three lines. – Ilmari Karonen Jan 30 '13 at 23:54
  • All three prints ﻟﻠﺒﻴﻊ (unseparated) for me. Sounds like an issue with your terminal. Perl 5.14.2 on x86_64-linux accessed using `putty` from a Windows machine. – ikegami Jan 31 '13 at 03:47
  • @ikegami : I would've thought the same, but see my reply to Kolink – Zaid Jan 31 '13 at 06:20
  • @RichardHuxton : I left the spaces in to prevent the HTML from rendering it as "ﻟﻠﺒﻴﻊ" – Zaid Jan 31 '13 at 06:22
  • @BRPocock : This is what happens when I print it to file: "عيبلل" – Zaid Jan 31 '13 at 06:32
  • @Zaid, Then `echo "ﻟﻠﺒﻴﻊ"` doesn't send the same as print the same as Perl (Actually, it's visibly quite different. Perl sent a hundred more characters.) If you think it's a Perl problem, it's easy to solve: Pipe the output to `od -t x1` – ikegami Jan 31 '13 at 06:37
  • Re update, And what's your output from Perl? – ikegami Jan 31 '13 at 06:48
  • Finally, what's `grep 'For sale' script.pl | od -t x1` – ikegami Jan 31 '13 at 08:03

2 Answers2

3
  • Source code returned by StackOverflow (as fetched using wget):

    ... ef bb 9f ef bb a0 ef ba 92 ef bb b4 ef bb 8a ...
    
    U+FEDF ARABIC LETTER LAM INITIAL FORM
    U+FEE0 ARABIC LETTER LAM MEDIAL FORM
    U+FE92 ARABIC LETTER BEH MEDIAL FORM
    U+FEF4 ARABIC LETTER YEH MEDIAL FORM
    U+FECA ARABIC LETTER AIN FINAL FORM
    
  • perl output I get from the source code returned by StackOverflow:

    ... ef bb 9f ef bb a0 ef ba 92 ef bb b4 ef bb 8a 0a
    ... ef bb 9f ef bb a0 ef ba 92 ef bb b4 ef bb 8a 0a
    ... ef bb 9f ef bb a0 ef ba 92 ef bb b4 ef bb 8a 0a
    
    U+FEDF ARABIC LETTER LAM INITIAL FORM
    U+FEE0 ARABIC LETTER LAM MEDIAL FORM
    U+FE92 ARABIC LETTER BEH MEDIAL FORM
    U+FEF4 ARABIC LETTER YEH MEDIAL FORM
    U+FECA ARABIC LETTER AIN FINAL FORM
    U+000A LINE FEED
    

    So I get exactly what's in the source, as I should.

  • perl output you got:

    ... d8 b9 d9 8a d8 a8 d9 84 d9 84 0a
    ... d8 b9 d9 8a d8 a8 d9 84 d9 84 0a
    ... d8 b9 d9 8a d8 a8 d9 84 d9 84 0a
    
    U+0639 ARABIC LETTER AIN
    U+064A ARABIC LETTER YEH
    U+0628 ARABIC LETTER BEH
    U+0644 ARABIC LETTER LAM
    U+0644 ARABIC LETTER LAM
    U+000A LINE FEED
    

    Ok, so you could have a buggy Perl (that reverses and changes Arabic characters and only those), but it's far more likely that your sources doesn't contain what you think it does. You need to check what bytes form up your source.

  • echo output you got:

    ef bb 8a ef bb b4 ef ba 92 ef bb a0 ef bb 9f 0a
    
    U+FECA ARABIC LETTER AIN FINAL FORM
    U+FEF4 ARABIC LETTER YEH MEDIAL FORM
    U+FE92 ARABIC LETTER BEH MEDIAL FORM
    U+FEE0 ARABIC LETTER LAM MEDIAL FORM
    U+FEDF ARABIC LETTER LAM INITIAL FORM
    U+000A LINE FEED
    

    There are significant differences in what you got from perl and from echo, so it's no surprise they show up differently.


Output inspected using:

$ perl -Mcharnames=:full -MEncode=decode_utf8 -E'
   say sprintf("U+%04X %s", $_, charnames::viacode($_))
      for unpack "C*", decode_utf8 pack "H*", $ARGV[0] =~ s/\s//gr;
' '...'

(Don't forget to swap the bytes of hexdump.)

ikegami
  • 367,544
  • 15
  • 269
  • 518
  • I'll need a few hours to get back to my machine. Thanks for your analysis so far.. very insightful – Zaid Jan 31 '13 at 12:10
  • You were right, my source string itself was messed up, even though vim renders it correctly, just as it's shown in the question. I'm not sure why that it the case, but the inspection tool you provided puts it beyond doubt. Thanks once again. – Zaid Jan 31 '13 at 20:57
  • AFAIK, presentation forms (e.g: the blocks U+FB50..U+FDFB, U+FE70..U+FEFE) are only provided for compatibility with legacy encodings (e.g: CP864), and are not supposed to be used in new text. The rendering system should take care of displaying the proper positional shapes. – ninjalj May 26 '14 at 09:49
1

Maybe something odd with your shell? If I redirect the output to a file, the result will be the same. Please try this out:

use strict;
use warnings;
use utf8;

binmode( STDOUT, ':utf8' );

use Encode qw< encode decode >;

my $str = 'ﻟﻠﺒﻴﻊ';                 # "For sale" 
my $enc = encode( 'UTF-8', $str );
my $dec = decode( 'UTF-8', $enc );

my $decoded = pack 'U0W*', map +ord, split //, $enc;

open(F1,'>',"origiinal.txt") or die;
open(F2,'>',"decoded.txt") or die;
open(F3,'>',"decoded2.txt") or die;

binmode(F1, ':utf8');binmode(F2, ':utf8');binmode(F3, ':utf8');

print F1 "$str\n";     #  ل ل ب ي ع   
print F2 "$dec\n";     #  ل ل ب ي ع
print F3 "$decoded\n";
user1126070
  • 5,059
  • 1
  • 16
  • 15
  • When I do `$ perl unicode.pl > test.txt`, the string reverses – Zaid Jan 31 '13 at 12:12
  • Did you tried to run this script and check the files with diff? Maybe your shell or editor has issues with utf8. – user1126070 Jan 31 '13 at 14:03
  • I'm not at my machine at the moment.. when I get there I'll try it out. However, I don't think this is the issue since my script already has `binmode( STDOUT,':utf8' );` set – Zaid Jan 31 '13 at 14:15