For historical reasons, Cocoa's Unicode implementation is 16-bit: it handles Unicode characters above 0xFFFF
via "surrogate pairs". This means that the following code is not going to work:
NSString myString = @"";
uint32_t codepoint = [myString characterAtIndex:0];
printf("%04x\n", codepoint); // incorrectly prints "d842"
Now, this code works 100% of the time, but it's ridiculously verbose:
NSString myString = @"";
uint32_t codepoint;
[@"" getBytes:&codepoint maxLength:4 usedLength:nil
encoding:NSUTF32StringEncoding options:0
range:NSMakeRange(0,2) remainingRange:nil];
printf("%04x\n", codepoint); // prints "20d20"
And this code using mbtowc
works, but it's still pretty verbose, affects global state, isn't thread-safe, and probably fills up the autorelease pool on top of all that:
setlocale(LC_CTYPE, "UTF-8");
wchar_t codepoint;
mbtowc(&codepoint, [@"" UTF8String], 16);
printf("%04x\n", codepoint); // prints "20d20"
Is there any simple Cocoa/Foundation idiom for extracting the first (or Nth) Unicode codepoint from an NSString? Preferably a one-liner that just returns the codepoint?
The answer given in this otherwise excellent summary of Cocoa Unicode support (near the end of the article) is simply "Don't try it. If your input contains surrogate pairs, filter them out or something, because there's no sane way to handle them properly."