how to use Java read Unicode range from font file

Question

I have a ttf file which contains Unicode and the corresponding font. As the figure shows:

The red box is the Unicode, and the text above it is the corresponding font. How could I extract the Unicode from the font file?

What's a `tff` file? Do you mean a `tiff` file (an image file) or a `ttf` file (a TrueType Font file) ? — Erwin Bolwidt, Jan 31 '18 at 02:35
Not exactly sure what you mean with "extract Unicode". However if you install the font in Java, you can get it's `java.awt.Font` object and you can call [`Font.canDisplay(char)`](https://docs.oracle.com/javase/7/docs/api/java/awt/Font.html#canDisplay(char)) or [`Font.canDisplay(int)`](https://docs.oracle.com/javase/7/docs/api/java/awt/Font.html#canDisplay(int)) to check if it can render a Java character or a Unicode codepoint, respectively. Is that what you mean? — Erwin Bolwidt, Jan 31 '18 at 02:41
thanks a lot. A web site use this font file to display its text information. For example, the original text is "high", then it use "$EDBC" in the html, then the browser shows normal "high" word in the page. While my crawler get the unicode "$EDBC". I get the font file and try to get the unicode $EDBC. — DuFei, Jan 31 '18 at 02:47
For my understanding, do you mean it's one of the letters for the English word "high", or do you mean a simplified Chinese character? I assume it's the latter, I assume it's showing `高` (U+9AD8) but with codepoint U+EDBC? — Erwin Bolwidt, Jan 31 '18 at 03:15

score 1 · Accepted Answer · answered Jan 31 '18 at 03:33

1

A Unicode font maps characters to glyphs. The process is described in this SO question: How does a Unicode character get mapped to a glyph in a font?

If a font maps a character to a glyph that doesn't look the what the character should be, there is no way to find out what other character does represent the glyph being shown (short of doing OCR on a rendered bitmap of the character).

In your case, the Java character (and Unicode codepoint) U+EDBC is in a Unicode Private Use Area:

In Unicode, a Private Use Area (PUA) is a range of code points that, by definition, will not be assigned characters by the Unicode Consortium. [...] The code points in these areas cannot be considered as standardized characters in Unicode itself. They are intentionally left undefined so that third parties may define their own characters without conflicting with Unicode Consortium assignments.

That means that there is not even an intended standard meaning for these characters. It is possible that there is some documentation for this font where you may find the meaning of the codepoints.

If not, your only option is to create your own mapping table from the characters used on the web page to standard unicode codepoints that, you believe, are the closest representation of the glyphs that the font shows.

answered Jan 31 '18 at 03:33

Erwin Bolwidt

30,799
15
56
79

1

Thanks a lot. I understand it. What I am try is to find a way to extract the Unicode codepoint U+EDBC from the font file. Is there any way can do it? – DuFei Jan 31 '18 at 06:26
What do you mean with "extract the codepoint"? I don't understand. – Erwin Bolwidt Jan 31 '18 at 07:27
just like the java code U+EDBC you mentioned. It is contained in the font file. I try to figure out what PUA Unicode is used in a font file – DuFei Jan 31 '18 at 07:29
I still don't understand. What does what you want to extract look like? A number? An image? – Erwin Bolwidt Jan 31 '18 at 07:31
for example, the creator of a font file use $EDBC to map the letter "A", I try to know what character (such as $EDBC) used in the font file – DuFei Jan 31 '18 at 07:37
No, that's what my answer tries to explain. $EDBC doesn't map the letter "A". It maps $EDBC to a glyph that has two diagonal lines and one horizontal line. There is nothing it there about the character "A". The only automated way to figure out that it's the character "A" is to render it to an image and then perform OCR Character Recognition on it. – Erwin Bolwidt Jan 31 '18 at 07:42
I know it. I do not care about "A", I only try to know the code $EDBC， for example, assuming that a font file only contains two maps which are $EDBC -> a, $ECB1 -> B. I only want to know $EDBC and $ECB1. I don't have to know "a" or "B". I want to output ['$EDBC','$ECB1'] – DuFei Jan 31 '18 at 07:55
That goes back to my first command about `canDisplay`: `Font f = Font.createFont(...); for (int i = 0xE000; i <= 0xF8FF; i++) { if (f.canDisplay(i)) { System.out.printf("%4x, ", i); } }` – Erwin Bolwidt Jan 31 '18 at 08:14

how to use Java read Unicode range from font file

1 Answers1