I need a very simple user interface for my Linux C++ program. The UI should just represent a list of strings through which the user can iterate with keyboard arrows or j/k/h/l keys. I am aware that it is generally not recommended to use such complicated low-level libraries as XLib
and Xft
, but since my needs for UI imply only basic functionality (rectangles with text), I don't want to have huge toolkits like Gtk/Qt
as dependencies for my program, so I use plain old Xlib
for drawing the UI and Xft
for drawing text. Although I don't need complicated UI, I want my program to be able to correctly display complicated text that may include icons, emojis, mixed languages, be either left-to-right or right-to-left or even bidirectional. I also need to count and process "real characters" in the text, not just plain bytes (chars). So, I decided to use ICU Library for text processing. It is quite a complicated beast, but after a couple of days of reading I think I now understand the very basics of unicode and ICU.
What I am interesting in is the process of rendering what user perceives as a single character. ICU provides icu::UnicodeString
class for storing unicode strings. The underlying data of this class is stored as a sequence of 16-bit blocks (code units), and one unicode code point may consist of either one or two (surrogate pair) such code units. With the help of icu::CharacterIterator
it is possible to iterate over such strings in terms of either code units or code points.
What I want to implement in my code is the following:
The user sets the main font that will always be preferred for drawing text. But if the text to be displayed consists of multiple languages or contains emojis, it may be impossible to display this entire text with the same font, ugly rectangles will appear in place of some characters. To solve this problem I use some FontConfig
functions to create a logically sorted array of system fonts, where the main font always has index 0. Now I can iterate over the string using icu::CharacterIterator
methods to extract every code point from the string and call FcCharSetHasChar() function from FontConfig
library to check if the main font can be used for displaying the character. If not, I iterate further over the array of fonts until the closest font is found that has the right glyph for the character (otherwise the fallback font is used). This is not great for performance reasons, but now every character of the string can be displayed with the appropriate font using Xft
functions for drawing text.
The problem is that in reality things are even more complicated:) The thing is that what the user perceives as a "character" is not a code point, but what is called a "grapheme cluster" that may consist of multiple code points. And here is where I am not sure how to correctly handle them. The good news is ICU has a class icu::BreakIterator
that can detect the correct boundaries for every grapheme cluster in the text, regardles of how many code points it consists of. Nice! But how can I correctly decide which font to use for drawing a block of multiple logically related code points composing one grapheme cluster?
In theory, I can do this:
- Detect the boundaries of a grapheme cluster using
icu::BreakIterator
. - Extract the first code point in the cluster using
icu::UnicodeString::char32At(grapheme_cluster_start)
- Iterate over array of fonts and use
FcCharSetHasChar()
to find the nearest font capable of representing this code point. - Use this font to draw the entire grapheme cluster.
But is it a reliable approach? If I find a font that has a glyph for the first code point in a grapheme cluster, can I assume that the same font can be used to correctly display the entire cluster? Or should I iterate over every code point in the cluster and find the same font that can be used for drawing all of them?