Finding "actual" characters (graphemes) in a QString

Question

Let's say I have a QString that may consist of any Unicode characters, and I want to iterate through its characters or count them. And by "characters" I mean what the user perceives as such (so roughly equivalent to "glyphs") and not simply QChars (16-bit Unicode characters). Some "actual" characters are built of several QChars (surrogate pairs; base character + combining marks). For some combining characters I might get away with normalizing the string to create composite characters, but that does not always help.

Have I overlooked a built-in function that splits a QString into "actual" characters?

Or if I have to parse it myself, is this the structure (in EBNF) or am I missing something?

character = ((high_surrogate, low_surrogate) | base_character), {combining_mark}

(with base_character being every QChar that is not a surrogate or combining character)

score 5 · Accepted Answer · answered Nov 04 '11 at 19:25

5

After more research I found the term for "actual character", grapheme, and with it the Qt class for finding grapheme boundaries: QTextBoundaryFinder.

answered Nov 04 '11 at 19:25

Sebastian Negraszus

11,915
7
43
70

Here is a code sample that uses `QTextBoundaryFinder`: https://stackoverflow.com/a/49558718/257299 – kevinarpe Apr 19 '20 at 12:47

score 2 · Answer 2 · answered Nov 04 '11 at 15:11

2

I am not sure about the combining marks, but for the surrogate pairs, I think you can use QString::toUcs4() which should return a 32-bit Unicode representation of your string.

answered Nov 04 '11 at 15:11

Steffen

2,888
19
19

QString::toUcs4() returns 2 value for surrogate pair – Ivan Romanov Nov 11 '16 at 19:50
QString::toUcs4() returns 2 value for surrogate pair in Qt4 but only 1 value for Qt5. So your answer is partially correct. – Ivan Romanov Nov 11 '16 at 20:03
My bad. Qt4 returns wrong vector size. So for string "" it returns 8 "1f600" "1f600" "1f600" "1f600" "0" "0" "0" "0" – Ivan Romanov Nov 11 '16 at 20:19

Finding "actual" characters (graphemes) in a QString

2 Answers2