Unicode - how to determine the right font for drawing a grapheme cluster consisting of multiple unicode codepoints?

Question

I need a very simple user interface for my Linux C++ program. The UI should just represent a list of strings through which the user can iterate with keyboard arrows or j/k/h/l keys. I am aware that it is generally not recommended to use such complicated low-level libraries as XLib and Xft, but since my needs for UI imply only basic functionality (rectangles with text), I don't want to have huge toolkits like Gtk/Qt as dependencies for my program, so I use plain old Xlib for drawing the UI and Xft for drawing text. Although I don't need complicated UI, I want my program to be able to correctly display complicated text that may include icons, emojis, mixed languages, be either left-to-right or right-to-left or even bidirectional. I also need to count and process "real characters" in the text, not just plain bytes (chars). So, I decided to use ICU Library for text processing. It is quite a complicated beast, but after a couple of days of reading I think I now understand the very basics of unicode and ICU.

What I am interesting in is the process of rendering what user perceives as a single character. ICU provides icu::UnicodeString class for storing unicode strings. The underlying data of this class is stored as a sequence of 16-bit blocks (code units), and one unicode code point may consist of either one or two (surrogate pair) such code units. With the help of icu::CharacterIterator it is possible to iterate over such strings in terms of either code units or code points.

What I want to implement in my code is the following:
The user sets the main font that will always be preferred for drawing text. But if the text to be displayed consists of multiple languages or contains emojis, it may be impossible to display this entire text with the same font, ugly rectangles will appear in place of some characters. To solve this problem I use some FontConfig functions to create a logically sorted array of system fonts, where the main font always has index 0. Now I can iterate over the string using icu::CharacterIterator methods to extract every code point from the string and call FcCharSetHasChar() function from FontConfig library to check if the main font can be used for displaying the character. If not, I iterate further over the array of fonts until the closest font is found that has the right glyph for the character (otherwise the fallback font is used). This is not great for performance reasons, but now every character of the string can be displayed with the appropriate font using Xft functions for drawing text.

The problem is that in reality things are even more complicated:) The thing is that what the user perceives as a "character" is not a code point, but what is called a "grapheme cluster" that may consist of multiple code points. And here is where I am not sure how to correctly handle them. The good news is ICU has a class icu::BreakIterator that can detect the correct boundaries for every grapheme cluster in the text, regardles of how many code points it consists of. Nice! But how can I correctly decide which font to use for drawing a block of multiple logically related code points composing one grapheme cluster?

In theory, I can do this:

Detect the boundaries of a grapheme cluster using icu::BreakIterator.
Extract the first code point in the cluster using icu::UnicodeString::char32At(grapheme_cluster_start)
Iterate over array of fonts and use FcCharSetHasChar() to find the nearest font capable of representing this code point.
Use this font to draw the entire grapheme cluster.

But is it a reliable approach? If I find a font that has a glyph for the first code point in a grapheme cluster, can I assume that the same font can be used to correctly display the entire cluster? Or should I iterate over every code point in the cluster and find the same font that can be used for drawing all of them?

You are reinventing Pango and Harfbuzz. really: reconsider and try use that libraries (I think you can use without installing all Gnome). Such libraries have a lot of bugs, but they have also a lot of inside of many languages and particularities. — Giacomo Catenazzi, Feb 24 '23 at 12:36
Thank you for the suggestion, I am just looking through Harfbuzz documentation right now and starting to think this is what I need. It cannot handle bidi though, but I think I can use some ICU functions for text processing and higher level libraries for displaying. — Alexey104, Feb 24 '23 at 13:15
HarfBuzz is only for shaping (so selecting the character, and move "cursor". Pango is for the bidi and more "high level" language support. Usually both (plus FreeTypes) are used together (but if programmer want to link one of this to system libraries in Windows or Mac). — Giacomo Catenazzi, Feb 24 '23 at 13:55

Unicode - how to determine the right font for drawing a grapheme cluster consisting of multiple unicode codepoints?

0 Answers0