How to get proper arabic char (proper form) from string in java?

Question

Im trying to calculate the width of chars in an arabic string in java, and I havent been able to get the proper char at times. If you dont know arabic chars can connect depending on their position in the word, and each variation has a different hex code: https://en.wikipedia.org/wiki/Arabic_script_in_Unicode#:~:text=External%20links-,Contextual%20forms,-%5Bedit%5D

What is happening is that, say i read in this char : ت (hex value FE95) . Now depending on where it comes in the word it can take different forms (refer to the link for context, this is the third letter in the table). When it comes in the beginning of a word it should look like: ﺗ (hex code FE97). In the string i read in it is shown as ﺗ (hex code FE97) but when i read it in using string.charAt(index) I get ت (hex value FE95). Does anyone know of a way to read in the proper character with the right hex value?

EDIT: For reference here is a string تشخيصك . None of the middle letters are read in in the right form:

The first letter is the one i described above
Second letter is the letter is the letter ش (hex FEB5). In the word its represented as ﺸ (hex FEB8), but when its read in via string.charAt it holds the value ش (hex FEB5)
The third letter is خ (hex FEA5). In the word it is represented as ﺨ (hex FEA8) but when its read in via string.charAt it holds the value خ (hex FEA5)

and so on... Any help would be much appreciated!

Holger · Answer 1 · 2022-03-10T09:40:56.230

From the Wikipedia article you’ve linked:

The presentation forms are present only for compatibility with older standards, and are not currently needed for coding text.

In other words, you should just use the generic form (i.e. the 0600–06FF range codepoints), instead of the presentation forms of the FE70–FEFF codepoint range. Note that your example string, تشخيصك, consists of 06xx characters only, at least as given to me by the browser.

If you have a legacy source of FE70–FEFF characters, don’t try to fix their order, but just drop the presentation information by converting the string(s) into the canonical form. E.g.

String s = "\uFE97\uFE98\uFE97\uFE96\uFE98";
System.out.println(s);
s = Normalizer.normalize(s, Normalizer.Form.NFKD);
System.out.println(s);
System.out.println(s.chars().mapToObj(i -> String.format("\\u%04X", i))
    .collect(Collectors.joining("", "\"", "\"")));

which prints

ﺗﺘﺗﺖﺘ
تتتتت
"\u062A\u062A\u062A\u062A\u062A"

The example’s source string has intentionally messed up presentation form characters, to show, how just getting a string with no presentation information, only consisting of repetitions of the generic U+062A character, solves the issue. In other words, the generic string is printed correctly.

This handling is done by the font, a feature known as glyph shaping. We can see it in action, e.g. with

Font font = new Font("DejaVu Sans", 0, 48); // or Droid Sans Arabic
FontRenderContext frc = new FontRenderContext(null, true, false);

String s = "\u062A\u062A\u062A";
System.out.println(s);

if(font.canDisplayUpTo(s) >= 0) {
    System.out.println("can't display string");
    System.exit(0);
}
GlyphVector g = font.layoutGlyphVector(frc,
    s.toCharArray(), 0, s.length(), Font.LAYOUT_RIGHT_TO_LEFT);

Rectangle r = g.getPixelBounds(frc, 0, 0);
System.out.println("Total width " + r.width);
for(int i = 0, n = g.getNumGlyphs(); i < n; i++) {
    int chPos = g.getGlyphCharIndex(i);
    System.out.printf("%2d (U+%04X) glyph code %4d, width %.0f%n",
        chPos,(int)s.charAt(chPos),g.getGlyphCode(i),g.getGlyphMetrics(i).getAdvance());
}
BufferedImage bi = new BufferedImage(r.width, r.height, BufferedImage.TYPE_BYTE_BINARY);
Graphics2D gfx = bi.createGraphics();
//System.out.println(r);
gfx.drawGlyphVector(g, -r.x, -r.y);
gfx.dispose();
for(int line = 0, nLines = r.height; line < nLines; line++) {
    for(int ch = 0, nChars = r.width; ch < nChars; ch++) {
        System.out.print((bi.getRGB(ch, line) & 0xff) > 0? 'X': ' ');
    }
    System.out.println();
}
System.out.println();

Try it online!

تتت
Total width 70
 2 (U+062A) glyph code 5261, width 47
 1 (U+062A) glyph code 5263, width 14
 0 (U+062A) glyph code 5262, width 13
                                               XX    XX      XX    XX 
                                              XXXX  XXXX    XXXX  XXXX
                                              XXXX  XXXX    XXXX  XXXX
                                              XXXX  XXXX    XXXX  XXXX
                                                                      
              XX    XX                                                
             XXXX  XXXX                                               
             XXXX  XXXX                                               
             XXXX  XXXX           XXXX                                
                                  XXXX                                
 XXXX                             XXXXX          XXXX          XXXX   
 XXXX                             XXXXX          XXXX          XXXX   
XXXXX                             XXXXX          XXXX          XXXX   
XXXX                              XXXXX          XXXX          XXXX   
XXXX                              XXXXX          XXXX          XXXX   
XXXX                             XXXXXX          XXXX          XXXX   
XXXXX                          XXXXXXXX          XXXX          XXXX   
XXXXX                        XXXXXXXXXXX        XXXXXX        XXXXX   
 XXXXXX                    XXXXXXXXXXXXX        XXXXXX        XXXXX   
 XXXXXXXX              XXXXXXXXXXX  XXXXXX    XXXXXXXXXX    XXXXXX    
  XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX   XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX    
   XXXXXXXXXXXXXXXXXXXXXXXXXXXX      XXXXXXXXXXXXXXXXXXXXXXXXXXXX     
     XXXXXXXXXXXXXXXXXXXXXXX          XXXXXXXXXXXX  XXXXXXXXXXXX      
        XXXXXXXXXXXXXXXX                XXXXXXXXX    XXXXXXXXXX

Note that the actual glyph numbers are entirely up to the specific font. E.g., some fonts map the middle ت character to the same glyph as the last one—a pure stylistic choice.

The program is only meant to demonstrate that the same codepoint (in this example, char is sufficient) may get mapped to different font specific glyphs.

Thank you! I was put on to something else for today but I will try your solution when I come back to this! — zaidabuhijleh, Mar 10 '22 at 18:37

How to get proper arabic char (proper form) from string in java?

1 Answers1

Linked