4

In Objective-c...

If I have a character like "∆" how can I get the unicode value and then determine if it is in a certain range of values.

For example if I want to know if a certain character is in the unicode range of U+1F300 to U+1F6FF

Albert Renshaw
  • 17,282
  • 18
  • 107
  • 195
  • 1
    Good question. It's trivial if the char is <= `U+FFFF`. Just use `unichar`. I haven't seen a good method for chars >= `U+10000`. – rmaddy Feb 11 '13 at 23:23
  • @rmaddy Is `unichar` a method for determining what the unicode value of a character is under `U+FFFF` or is it a method for determining what range a given unicode value is in? – Albert Renshaw Feb 11 '13 at 23:34
  • 1
    `unichar` is a data type. See the `NSString characterAtIndex:` method. – rmaddy Feb 11 '13 at 23:35
  • @rmaddy Worked good so far... when I tried NSLogging it I used `%hu` and it worked all the way up to `55357` ... then every unichar after that returned the value `55357` no matter how much higher I went... what to use other than `%hu`? – Albert Renshaw Feb 11 '13 at 23:48
  • What do you mean by "have a character"? Where do you have it? How? Is it in a variable? Of what type? How is the character represented (e.g. UTF-8, UTF-16)? – Ken Thomases Feb 12 '13 at 05:27
  • Nothing specific yet... but for the sake of example we can just say it's in a standard `NSString` – Albert Renshaw Feb 12 '13 at 05:47

1 Answers1

2

NSString uses UTF-16 to store codepoints internally, so those in the range you're looking for (U+1F300 to U+1F6FF) will be stored as a surrogate pair (four bytes). Despite its name, characterAtIndex: (and unichar) doesn't know about codepoints and will give you the two bytes that it sees at the index you give it (the 55357 you're seeing is the lead surrogate of the codepoint in UTF-16).

To examine the raw codepoints, you'll want to convert the string/characters into UTF-32 (which encodes them directly). To do this, you have a few options:

  1. Get all UTF-16 bytes that make up the codepoint, and use either this algorithm or CFStringGetLongCharacterForSurrogatePair to convert the surrogate pairs to UTF-32.

  2. Use either dataUsingEncoding: or getBytes:maxLength:usedLength:encoding:options:range:remainingRange: to convert the NSString to UTF-32, and interpret the raw bytes as a uint32_t.

  3. Use a library like ICU.

一二三
  • 21,059
  • 11
  • 65
  • 74