6

simple question:

this is the final display string I am looking for

لعبة ديدة

now below is each of the separate characters, before being 'glued' together (so I've put a space between each of them to stop the joining)

ل ع ب ة د ي د ة

note how they are NOT the same characters, there is some magical transform that melds them together and converts them to new Unicode characters.

and then in that above, the characters are actually appearing right to left (in memory, they are left to right)

so my simple question is this: where do I get a platform independent c/c++ function that will take my source 16 bit Unicode string, and do the transform on it to result in the Unicode string that will create the one first quoted above? doing the RTL conversion, and the joining?

that's all I want, one function that does that.

UPDATE:

ok, yes, I know that the 'characters' are the same in the two above examples, they are the same 'letters' but (viewing in chrome, or latest IE) anyone can CLEARLY see that the glyphs are different. now I'm fairly confident that this transform that needs to be done can be done on the unicode level, because my font file, and the unicode standard, seems to specify the different glyphs for both the separate, and various joined versions of the characters/letters. (unicode.org/charts/PDF/UFB50.pdf unicode.org/charts/PDF/UFE70.pdf)

so, can I just put my unicode into a function and get the transformed unicode out?

matt
  • 4,042
  • 5
  • 32
  • 50
  • 2
    For those not fluent in Arabic, could you point out the differences? The two strings appear quite the same, except for joiners in the first and the spaces in the second string. And those are expected. Also, strings in memory are stored low-to-high address, not left-to-right. LTR is just how you render Latin fonts. – MSalters Oct 18 '11 at 07:55
  • Memory doesn't have left/right. Just lower/higher, or if you prefer, before/after. – rodrigo Oct 18 '11 at 08:03
  • I don't know of any standard libraries that do this (although I am sure some exist) but the phrase you need to google for is "logical to visual conversion". The codepoints are stored as "logical" characters but you need to convert them to "visual" for display. – Vicky Oct 18 '11 at 08:39
  • 2
    You need to run your text through the Unicode bidirectional and glyph reordering algorithm. This is a very complex beast, so you best leave this to a libray. `libicu` is the only free one I can think of right now. – Kerrek SB Oct 18 '11 at 10:38

5 Answers5

9

The joining and RTL conversion don't happen at the level of Unicode characters.

In other words: the order of the characters and the actual unicode codepoints are not changed during this process.

In fact, the merging and handling RTL/LTR transitions is handled by the text rendering engine.

This quote from the Wikipedia article on the Arabic alphabet explains it quite nicely:

Finally, the Unicode encoding of Arabic is in logical order, that is, the characters are entered, and stored in computer memory, in the order that they are written and pronounced without worrying about the direction in which they will be displayed on paper or on the screen. Again, it is left to the rendering engine to present the characters in the correct direction, using Unicode's bi-directional text features. In this regard, if the Arabic words on this page are written left to right, it is an indication that the Unicode rendering engine used to display them is out-of-date.

Joachim Sauer
  • 302,674
  • 57
  • 556
  • 614
  • 2
    Perhaps he's looking for canonicalization/normalization as well, in which case that isn't just a visual property but would actually combine/split/reorder the code points. – edA-qa mort-ora-y Oct 18 '11 at 11:10
  • hmm, well certainly the reordering could happen on the text level, its just rearranging characters. as for the merging into different glyphs, well, I thought thats what these characters were for? http://unicode.org/charts/PDF/UFB50.pdf http://unicode.org/charts/PDF/UFE70.pdf – matt Oct 19 '11 at 00:35
  • @matt: those ranges are just for roundtrip conversion, e.g: from/to cp864, which has positional variants. Note that other legacy encodings (cp720, cp868, iso-859-6, cp1256) don't have positional variants. – ninjalj May 22 '14 at 15:51
6

The processing you're looking for is called ligature. Unlike many latin-based languages, where you can simply put one character after another to render the text, ligatures are fundamental in arabic. The substitution is done in the text rendering engine, and the ligature infos are generally stored in font files.

note how they are NOT the same characters

They are the same for an Arabic reader. It is still readable. There is no transform to do on your Unicode16 source text. You must provide the whole string to your text renderer. In C/C++, and as you are going the platform independent way, you can use Pango for rendering.

Note : Perhaps you wanted to write لعبة جديدة (i.e. new game) ? Because what you give as an example has no meaning in Arabic.

overcoder
  • 1,523
  • 14
  • 24
3

I realise this is an old question, but what you're looking for is FriBidi, the GNU implementation of the Unicode bidirectional algorithm.

This program does the glyph selection that was asked about in the question, as well as handling bidirectional text (mixture of right-to-left and left-to-right text).

Aky
  • 1,777
  • 1
  • 14
  • 19
1

What you are looking for is an Arabic script synthesis algorithm. I'm not aware one exists as open source. If you arrive at one please post.

Some points:

At the storage level, there is no Unicode transform. There is an abstract representation of the string as pointed out by other answers.

At the rendering level, you could choose to use Unicode Presentation Forms, but you could also choose to use other forms. Unicode Presentation Forms are not a standard for what presentation output encoding should be - rather they are just one example of presentation codes that can be output by the rendering engine using script synthesis.

To make it clearer: There wouldn't be a single standard transform (ie synthesis algorithm) that would transform from A to B, where A is standard Unicode Arabic page, and B is standard Unicode Arabic Presentation Forms. Rather, there would be different transformations that can vary in complexity and can have different encoding systems for B, but one of the encodings that can be used for B is the Unicode Presentation Forms. For example, a simple typewriter style would require a simple rendering algorithm that would not require Presentation Forms. Indeed there does exist modern writing styles (not in common usage though) where A and B are actually identical, only that a different font page would be used to do the rendering. On the other hand, the transform to render typesetting or traditional calligraphic forms would be more complex and require something similar to the Unicode Presentation Forms.

Here are a couple of pointers for more information on the topic:

Basel Shishani
  • 7,735
  • 6
  • 50
  • 67