Simplest way to extract first Unicode codepoint of an NSString (outside the BMP)?

Question

For historical reasons, Cocoa's Unicode implementation is 16-bit: it handles Unicode characters above 0xFFFF via "surrogate pairs". This means that the following code is not going to work:

NSString myString = @"";
uint32_t codepoint = [myString characterAtIndex:0];
printf("%04x\n", codepoint);  // incorrectly prints "d842"

Now, this code works 100% of the time, but it's ridiculously verbose:

NSString myString = @"";
uint32_t codepoint;
[@"" getBytes:&codepoint maxLength:4 usedLength:nil
    encoding:NSUTF32StringEncoding options:0
    range:NSMakeRange(0,2) remainingRange:nil];
printf("%04x\n", codepoint);  // prints "20d20"

And this code using mbtowc works, but it's still pretty verbose, affects global state, isn't thread-safe, and probably fills up the autorelease pool on top of all that:

setlocale(LC_CTYPE, "UTF-8");
wchar_t codepoint;
mbtowc(&codepoint, [@"" UTF8String], 16);
printf("%04x\n", codepoint);  // prints "20d20"

Is there any simple Cocoa/Foundation idiom for extracting the first (or Nth) Unicode codepoint from an NSString? Preferably a one-liner that just returns the codepoint?

The answer given in this otherwise excellent summary of Cocoa Unicode support (near the end of the article) is simply "Don't try it. If your input contains surrogate pairs, filter them out or something, because there's no sane way to handle them properly."

score 6 · Answer 1 · answered Oct 08 '12 at 23:11

6

A single Unicode code point might be a Surrogate Pair, but also not all language characters are single code points. i.e. not all language characters are represented by one or two UTF-16 units. Many characters are represented by a sequence of Unicode code points.

This means that unless you are dealing with Ascii you have to think of language characters as substrings, not unicode code points at indexes.

To get the substring for the character at index 0:

NSRange r = [[myString rangeOfComposedCharacterSequenceAtIndex:0];
[myString substringWithRange:r];

This may or may not be what you want depending on what you are actually hoping to do. e.g. although this will give you 'character boundaries' these won't correspond to cursor insertion points, which are language specific.

answered Oct 08 '12 at 23:11

hooleyhoop

9,128
5
37
58

1

Correct, and I see my use of "character" was problematic above. I'm aware that `é` can be composed of two codepoints `´` and `e`, for example. For my purposes it's kind of unclear what good a single codepoint would be, but for the sake of definability I'm looking for something that works like C's `mbtowc` for NSStrings, which means codepoints. I'll change "first character" to "first codepoint" throughout to avoid ambiguity. – Quuxplusone Oct 08 '12 at 23:57
Upvoting this because it is probably what most people will actually want when they ask this question and Google sends them here, even though it is not strictly the answer to what the OP wanted. – uliwitness May 01 '18 at 12:40

Simplest way to extract first Unicode codepoint of an NSString (outside the BMP)?

1 Answers1

Linked

Related