0

I'd like to know, is there any way to get number of characters (represented by the underlying Unicode code points) that are stored in CFString object in the CoreFoundation framework.

There is available function: CFStringGetLength, but it does not do what it seems to do.

Example: I am trying to the get length of string containing one character (letter "peep" of Shavian Alphabet) which lies in the second (SMP) Unicode plane.

UInt8 arr[] = {0xf0, 0x90, 0x91, 0x90}; //UTF8
CFStringRef r = CFStringCreateWithBytes(0, arr, sizeof(arr),
                                        kCFStringEncodingUTF8, false);
CFIndex length = CFStringGetLength(r);

Documentation states that it returns:

The number (in terms of UTF-16 code pairs) of characters stored in theString.

As you can see, this sentence is contradictory - number of characters is not always equal to the number of UTF-16 code points. However, the part in braces is more accurate - actual result of function is number of UTF-16 sequences. In my example, result of function is 2 (the length of the sequence required to encode the character in UTF-16), while the function name suggests that result would be 1 (in my opinion).

I'd like to find a way to get number of characters in terms of Unicode code points. Is there any way to do it in CoreFoundation?

Sean
  • 5,233
  • 5
  • 22
  • 26
notsurewhattodo
  • 446
  • 4
  • 11
  • `CFStringGetLength()` does actually return the number of characters "in terms of Unicode code points." The UTF-16 character sequence required to render '' is [0xD801,0xDC50]. What you **appear** to be looking for—to be more precise—is to return the number of graphemes (or "glyphs" if you're speaking about fonts) *represented* by the underlying UTF-16 sequences. I'll cook up an example that does what you want soon using correct CF API. – Sean Sep 20 '16 at 22:04
  • 1
    @Sean I couldn't find any definition what `CFStringGetLength` returns except "UTF-16 code pairs". And I couldn't find any clue what "UTF-16 code pairs" means. Could you explain how "code pairs" are actually "code points"? – eonil Feb 11 '20 at 11:42

3 Answers3

1

I've found a workaround. This is not perfect, as it probably requires additional conversion to UTF-32.

UInt8 arr[] = {0xf0, 0x90, 0x91, 0x90}; //UTF8, 
CFStringRef r = CFStringCreateWithBytes(0,
                                        arr,
                                        sizeof(arr),
                                        kCFStringEncodingUTF8,
                                        false);
CFIndex length = CFStringGetLength(r);
CFRange range = CFRangeMake(0, length);
CFIndex bytes;
CFStringGetBytes(r, range, kCFStringEncodingUTF32, 0, false, nullptr,
                 0, &bytes);
CFIndex characterCount = bytes/4;

Workaround utilizies fact, that in contrast to UTF-16, UTF-32 by definition contains single code point in single entity. And, as entity is defined to be four bytes size, and the CFStringGetBytes has the ability to get number of bytes required to store string after conversion, it is possible to get number of code points by dividing number of bytes by 4.

Anyway, CFStringGetBytes main purpose is executing actual conversion, so even when passing nullptr as buffer argument, it is possible that at least main part of conversion actually takes place. For this reason, it would be great to hear another solution for the problem.

notsurewhattodo
  • 446
  • 4
  • 11
  • Did you normalize the string? – uchuugaka Mar 23 '13 at 02:36
  • No, as far as I know, CoreFoundation does not have routine for that. However, I don't see straight connection between Unicode normalization and my problem - could you explain the reason of your question/suggestion? – notsurewhattodo Mar 27 '13 at 16:36
  • Well, the count may be different with composed or decomposed characters, especially in strings that contain both. Converting to UTF-32 does not change this at all. You still need to decide whether and how normalization is needed. – uchuugaka Mar 27 '13 at 16:44
  • I know that normalization can change the string length - but this is out of my problem's scope. I do not want to modify the string, just get the count of _code points_ that exists in given CFString. – notsurewhattodo Mar 27 '13 at 16:52
1

If you want to know the number of “characters” as a user sees them, regardless of normalization, loop over the composed character sequences using the range returned by CFStringGetRangeOfComposedCharactersAtIndex and count the iterations.

Martin Winter
  • 1,370
  • 9
  • 10
  • This is the correct answer. At time of writing this comment, CF does not export any symbols that do this for you. – Sean Sep 20 '16 at 21:48
  • This link provides a snippet for that loop to count graphemes: https://www.objc.io/issues/9-strings/unicode/#looping – Avitzur Sep 16 '18 at 18:56
0

(This is my guess...)

I could find "no definition" about what CFStringGetLength returns. All Apple manuals just say UTF-16 code pairs(?), and honestly, I can't figure out what it means. Unicode is complex and there are many subtle different concepts. We cannot find out what it is without precise terms.

Anyway in my guess, it should be same with [NSString length] as CFString and NSString are toll-free bridged, and they should store same data to provide best performance. And [NSString length] returns number of UTF-16 Code Unit. This is strictly defined in Apple manual. Please note on difference of terms. "Code Unit" is well-defined Unicode term, but "code pair" is unknown one. (Does anyone know about this?) Also "Code Unit" is not same with "Code Point".

So I assume it would return "UTF-16 Code Units", but I won't bet on my guess. I would convert it into NSString and call [NSString length] to get strictly defined number.


To get "Unicode Grapheme Clusters", it's best to use Swift Strings. Swift String has native interface to access Grapheme Clusters. Convert them into Swift String and iterate on it.

eonil
  • 83,476
  • 81
  • 317
  • 516