5

Let's say I have a QString that may consist of any Unicode characters, and I want to iterate through its characters or count them. And by "characters" I mean what the user perceives as such (so roughly equivalent to "glyphs") and not simply QChars (16-bit Unicode characters). Some "actual" characters are built of several QChars (surrogate pairs; base character + combining marks). For some combining characters I might get away with normalizing the string to create composite characters, but that does not always help.

Have I overlooked a built-in function that splits a QString into "actual" characters?

Or if I have to parse it myself, is this the structure (in EBNF) or am I missing something?

character = ((high_surrogate, low_surrogate) | base_character), {combining_mark}

(with base_character being every QChar that is not a surrogate or combining character)

Sebastian Negraszus
  • 11,915
  • 7
  • 43
  • 70

2 Answers2

5

After more research I found the term for "actual character", grapheme, and with it the Qt class for finding grapheme boundaries: QTextBoundaryFinder.

Sebastian Negraszus
  • 11,915
  • 7
  • 43
  • 70
2

I am not sure about the combining marks, but for the surrogate pairs, I think you can use QString::toUcs4() which should return a 32-bit Unicode representation of your string.

Steffen
  • 2,888
  • 19
  • 19