1

I have the following task: some text in mixed latin/arabic written in UTF-8 needs to be converted for printing using POS-printer, which uses ancient one-byte code page 864.

text.getBytes("ibm-864") suddenly shows many question marks instead of arabic characters, and after digging the code I understood that conversion table has some different versions of arabic characters used to map to ibm-864 (somewhere in the FExx range rather than 06xx, which I have in my text).

I'm looking for some code or library, which can convert arabic unicode to cp864, preferrably mapping to the corresponding forms of arabic chars (in cp864 there're isolated, initial, medial and final forms for some chars), and maybe even handling reverse for RTL, because I doubt that hardware supports it automatically.

I understand that this is very specific task, but why don't give it a try? Also I know how to implement this, but trying to find a ready-to-use bicycle :)

Anyone?

Another possible solution: library that can translate unicode arabic from the range U+0600 - U+06FF Arabic to the range U+FE70 - U+FF6F Arabic Presentation Forms-B. Then I can safely get my bytes in cp864. Have anyone seen anything alike?

dmitry
  • 4,989
  • 5
  • 48
  • 72
  • Do you have to use Java as the tags suggest? Otherwise, have you tried the standard `iconv` utility to see if it handles this conversion correctly? – DUman Mar 11 '15 at 09:31
  • Yes, I have java. I will try iconv for curiosity, but chances that it will use this conversion table are high, because it is a spec: http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/PC/CP864.TXT – dmitry Mar 11 '15 at 09:34
  • iconv cannot handle this, expectedly – dmitry Mar 11 '15 at 09:58
  • Well, I've undergone some research and torture and solved my task, but in python, which I used to preprocess my data. So I cannot place answer to my question as is. If someone later will stumble here with no clue where to go - contact me. The process I've done - manual generic unicode to contextual forms translation + explicit BiDi. – dmitry Mar 16 '15 at 14:32
  • ICU supports that kind of translation. You have to tell it to use a non-roundtrip conversion, so it does convert the characters to their presentation forms. – ninjalj Jun 09 '15 at 19:06

1 Answers1

4

To output arabic text to a relatively dumb output device, you'll need to do several things:

  • Divide the text into blocks of different directionality using the Unicode Bidirectional Algorithm (UBA), better known as Bidi.
  • Mirror characters that need to be mirrored (e.g: opening parenthesis point in different directions when they are inside LTR/RTL blocks)
  • Since the output device is dumb, you'll need to change characters into their positional forms, and apply ligatures where needed (there is a ligature for LAM + ALEF). This is done by a piece of software called an Arabic Shaper.
  • You'll need to reorder text according to their directionality.
  • Since CP864 doesn't have all the positional forms for all characters, you'll need to convert to fallback forms, converting some final forms to isolated forms, some medial forms to initial forms, and some initial forms to isolated forms. The text will not ligate as nicely as if there were proper forms, but it will come relatively close.

On Java, the ICU library allows you to do that:

  • ICU's Bidi can take care of dividing into blocks, mirroring, and reordering. Reordering can be done before shaping, since ICU's ArabicShaping supports working with text in both logical (pre-reordering) and visual (post-reordering) order.
  • ICU's ArabicShaping can take care of shaping the text, mapping it into the appropriate presentational forms (the FExx range you talked about, which is not meant to be used normally, it is only meant to be used to interface with legacy software/hardware, in this case the printer that understands CP864 but not Unicode).
  • ICU's CharsetProvider and CharsetEncoder can be used to convert to CP864 using a fallback (non-roundtrip) conversion for characters that are not on the output charset, in this case the final→isolated, medial→initial,... forms.
ninjalj
  • 42,493
  • 9
  • 106
  • 148
  • Thanks for extended answer, will use ICU in future. – dmitry Jun 10 '15 at 20:19
  • Would you mind providing a code sample of how to do that with the ICU library? because I tried searching about that but couldn't find a sample to achieve this. – blueware Feb 01 '21 at 10:36